2022-01-20

  • cs.LG updates on arXiv.org

    On the Expressivity of Markov Reward. (arXiv:2111.00876v2 [cs.LG] UPDATED)
    (2 min) Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.
    On the Optimality of the Oja's Algorithm for Online PCA. (arXiv:2104.00512v2 [cs.LG] UPDATED)
    (2 min) In this paper we analyze the behavior of the Oja's algorithm for online/streaming principal component subspace estimation. It is proved that with high probability it performs an efficient, gap-free, global convergence rate to approximate an principal component subspace for any sub-Gaussian distribution. Moreover, it is the first time to show that the convergence rate, namely the upper bound of the approximation, exactly matches the lower bound of an approximation obtained by the offline/classical PCA up to a constant factor.
    Deep Unified Representation for Heterogeneous Recommendation. (arXiv:2201.05861v1 [cs.IR])
    (2 min) Recommendation system has been a widely studied task both in academia and industry. Previous works mainly focus on homogeneous recommendation and little progress has been made for heterogeneous recommender systems. However, heterogeneous recommendations, e.g., recommending different types of items including products, videos, celebrity shopping notes, among many others, are dominant nowadays. State-of-the-art methods are incapable of leveraging attributes from different types of items and thus suffer from data sparsity problems. And it is indeed quite challenging to represent items with different feature spaces jointly. To tackle this problem, we propose a kernel-based neural network, namely deep unified representation (or DURation) for heterogeneous recommendation, to jointly model unified representations of heterogeneous items while preserving their original feature space topology structures. Theoretically, we prove the representation ability of the proposed model. Besides, we conduct extensive experiments on real-world datasets. Experimental results demonstrate that with the unified representation, our model achieves remarkable improvement (e.g., 4.1% ~ 34.9% lift by AUC score and 3.7% lift by online CTR) over existing state-of-the-art models.
    Black-Box Testing of Deep Neural Networks through Test Case Diversity. (arXiv:2112.12591v2 [cs.SE] UPDATED)
    (2 min) Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyse their statistical association with fault detection using two datasets and three DNN models. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs.
    Deep Learning strategies for ProtoDUNE raw data denoising. (arXiv:2103.01596v2 [hep-ph] UPDATED)
    (2 min) In this work, we investigate different machine learning-based strategies for denoising raw simulation data from the ProtoDUNE experiment. The ProtoDUNE detector is hosted by CERN and it aims to test and calibrate the technologies for DUNE, a forthcoming experiment in neutrino physics. The reconstruction workchain consists of converting digital detector signals into physical high-level quantities. We address the first step in reconstruction, namely raw data denoising, leveraging deep learning algorithms. We design two architectures based on graph neural networks, aiming to enhance the receptive field of basic convolutional neural networks. We benchmark this approach against traditional algorithms implemented by the DUNE collaboration. We test the capabilities of graph neural network hardware accelerator setups to speed up training and inference processes.
    Targeted VAE: Variational and Targeted Learning for Causal Inference. (arXiv:2009.13472v5 [stat.ML] UPDATED)
    (2 min) Undertaking causal inference with observational data is incredibly useful across a wide range of tasks including the development of medical treatments, advertisements and marketing, and policy making. There are two significant challenges associated with undertaking causal inference using observational data: treatment assignment heterogeneity (\textit{i.e.}, differences between the treated and untreated groups), and an absence of counterfactual data (\textit{i.e.}, not knowing what would have happened if an individual who did get treatment, were instead to have not been treated). We address these two challenges by combining structured inference and targeted learning. In terms of structure, we factorize the joint distribution into risk, confounding, instrumental, and miscellaneous factors, and in terms of targeted learning, we apply a regularizer derived from the influence curve in order to reduce residual bias. An ablation study is undertaken, and an evaluation on benchmark datasets demonstrates that TVAE has competitive and state of the art performance.
    Faster Rates of Private Stochastic Convex Optimization. (arXiv:2108.00331v3 [cs.LG] UPDATED)
    (2 min) In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $\theta>1$. Specifically, we first show that under some mild assumptions on the loss functions, there is an algorithm whose output could achieve an upper bound of $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log \frac{1}{\delta}}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $(\epsilon, \delta)$-DP when $\theta\geq 2$, here $n$ is the sample size and $d$ is the dimension of the space. Then we address the inefficiency issue, improve the upper bounds by $\text{Poly}(\log n)$ factors and extend to the case where $\theta\geq \bar{\theta}>1$ for some known $\bar{\theta}$. Next we show that the excess population risk of population functions satisfying TNC with parameter $\theta\geq 2$ is always lower bounded by $\Omega((\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $ and $\Omega((\frac{\sqrt{d\log \frac{1}{\delta}}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $\epsilon$-DP and $(\epsilon, \delta)$-DP, respectively. In the second part, we focus on a special case where the population risk function is strongly convex. Unlike the previous studies, here we assume the loss function is {\em non-negative} and {\em the optimal value of population risk is sufficiently small}. With these additional assumptions, we propose a new method whose output could achieve an upper bound of $O(\frac{d\log\frac{1}{\delta}}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$ for any $\tau\geq 1$ in $(\epsilon,\delta)$-DP model if the sample size $n$ is sufficiently large.
    Multi-View representation learning in Multi-Task Scene. (arXiv:2201.05829v1 [cs.CV])
    (2 min) Over recent decades have witnessed considerable progress in whether multi-task learning or multi-view learning, but the situation that consider both learning scenes simultaneously has received not too much attention. How to utilize multiple views latent representation of each single task to improve each learning task performance is a challenge problem. Based on this, we proposed a novel semi-supervised algorithm, termed as Multi-Task Multi-View learning based on Common and Special Features (MTMVCSF). In general, multi-views are the different aspects of an object and every view includes the underlying common or special information of this object. As a consequence, we will mine multiple views jointly latent factor of each learning task which consists of each view special feature and the common feature of all views. By this way, the original multi-task multi-view data has degenerated into multi-task data, and exploring the correlations among multiple tasks enables to make an improvement on the performance of learning algorithm. Another obvious advantage of this approach is that we get latent representation of the set of unlabeled instances by the constraint of regression task with labeled instances. The performance of classification and semi-supervised clustering task in these latent representations perform obviously better than it in raw data. Furthermore, an anti-noise multi-task multi-view algorithm called AN-MTMVCSF is proposed, which has a strong adaptability to noise labels. The effectiveness of these algorithms is proved by a series of well-designed experiments on both real world and synthetic data.
    Spatio-Temporal meets Wavelet: Disentangled Traffic Flow Forecasting via Efficient Spectral Graph Attention Network. (arXiv:2112.02740v2 [cs.LG] UPDATED)
    (2 min) Traffic forecasting is crucial for public safety and resource optimization, yet is very challenging due to three aspects: i) current existing works mostly exploit intricate temporal patterns (e.g., the short-term thunderstorm and long-term daily trends) within a single method, which fail to accurately capture spatio-temporal dependencies under different schemas; ii) the under-exploration of the graph positional encoding limit the extraction of spatial information in the commonly used full graph attention network; iii) the quadratic complexity of the full graph attention introduces heavy computational needs. To achieve the effective traffic flow forecasting, we propose an efficient spectral graph attention network with disentangled traffic sequences. Specifically, the discrete wavelet transform is leveraged to obtain the low- and high-frequency components of traffic sequences, and a dual-channel encoder is elaborately designed to accurately capture the spatio-temporal dependencies under long- and short-term schemas of the low- and high-frequency components. Moreover, a novel wavelet-based graph positional encoding and a query sampling strategy are introduced in our spectral graph attention to effectively guide message passing and efficiently calculate the attention. Extensive experiments on four real-world datasets show the superiority of our model, i.e., the higher traffic forecasting precision with lower computational cost.
    Inspecting state of the art performance and NLP metrics in image-based medical report generation. (arXiv:2011.09257v3 [cs.CL] UPDATED)
    (2 min) Several deep learning architectures have been proposed over the last years to deal with the problem of generating a written report given an imaging exam as input. Most works evaluate the generated reports using standard Natural Language Processing (NLP) metrics (e.g. BLEU, ROUGE), reporting significant progress. In this article, we contrast this progress by comparing state of the art (SOTA) models against weak baselines. We show that simple and even naive approaches yield near SOTA performance on most traditional NLP metrics. We conclude that evaluation methods in this task should be further studied towards correctly measuring clinical accuracy, ideally involving physicians to contribute to this end.
    Improving Adversarial Robustness via Channel-wise Activation Suppressing. (arXiv:2103.08307v2 [cs.LG] UPDATED)
    (2 min) The study of adversarial examples and their activation has attracted significant attention for secure and robust learning with deep neural networks (DNNs). Different from existing works, in this paper, we highlight two new characteristics of adversarial examples from the channel-wise activation perspective: 1) the activation magnitudes of adversarial examples are higher than that of natural examples; and 2) the channels are activated more uniformly by adversarial examples than natural examples. We find that the state-of-the-art defense adversarial training has addressed the first issue of high activation magnitudes via training on adversarial examples, while the second issue of uniform activation remains. This motivates us to suppress redundant activation from being activated by adversarial perturbations via a Channel-wise Activation Suppressing (CAS) strategy. We show that CAS can train a model that inherently suppresses adversarial activation, and can be easily applied to existing defense methods to further improve their robustness. Our work provides a simple but generic training strategy for robustifying the intermediate layer activation of DNNs.
    SMACE: A New Method for the Interpretability of Composite Decision Systems. (arXiv:2111.08749v2 [cs.LG] UPDATED)
    (2 min) Interpretability is a pressing issue for decision systems. Many post hoc methods have been proposed to explain the predictions of a single machine learning model. However, business processes and decision systems are rarely centered around a unique model. These systems combine multiple models that produce key predictions, and then apply decision rules to generate the final decision. To explain such decisions, we propose the Semi-Model-Agnostic Contextual Explainer (SMACE), a new interpretability method that combines a geometric approach for decision rules with existing solutions for machine learning models to generate an intuitive feature ranking tailored to the end user. We show that established model-agnostic approaches produce poor results on tabular data in this setting, in particular giving the same importance to several features, whereas SMACE can rank them in a meaningful way.
    Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers. (arXiv:2105.08059v3 [eess.IV] UPDATED)
    (2 min) Supervised reconstruction models are characteristically trained on matched pairs of undersampled and fully-sampled data to capture an MRI prior, along with supervision regarding the imaging operator to enforce data consistency. To reduce supervision requirements, the recent deep image prior framework instead conjoins untrained MRI priors with the imaging operator during inference. Yet, canonical convolutional architectures are suboptimal in capturing long-range relationships, and priors based on randomly initialized networks may yield suboptimal performance. To address these limitations, here we introduce a novel unsupervised MRI reconstruction method based on zero-Shot Learned Adversarial TransformERs (SLATER). SLATER embodies a deep adversarial network with cross-attention transformers to map noise and latent variables onto coil-combined MR images. During pre-training, this unconditional network learns a high-quality MRI prior in an unsupervised generative modeling task. During inference, a zero-shot reconstruction is then performed by incorporating the imaging operator and optimizing the prior to maximize consistency to undersampled data. Comprehensive experiments on brain MRI datasets clearly demonstrate the superior performance of SLATER against state-of-the-art unsupervised methods.
    Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. (arXiv:2104.03986v2 [cs.DB] UPDATED)
    (2 min) Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time. Code is available at https://github.com/ArjitJ/DIAL
    Meta-repository of screening mammography classifiers. (arXiv:2108.04800v3 [cs.LG] UPDATED)
    (2 min) Artificial intelligence (AI) is showing promise in improving clinical diagnosis. In breast cancer screening, recent studies show that AI has the potential to improve early cancer diagnosis and reduce unnecessary workup. As the number of proposed models and their complexity grows, it is becoming increasingly difficult to re-implement them. To enable reproducibility of research and to enable comparison between different methods, we release a meta-repository containing models for classification of screening mammograms. This meta-repository creates a framework that enables the evaluation of AI models on any screening mammography data set. At its inception, our meta-repository contains five state-of-the-art models with open-source implementations and cross-platform compatibility. We compare their performance on seven international data sets. Our framework has a flexible design that can be generalized to other medical image analysis tasks. The meta-repository is available at https://www.github.com/nyukat/mammography_metarepository.
    Particle Dual Averaging: Optimization of Mean Field Neural Networks with Global Convergence Rate Analysis. (arXiv:2012.15477v3 [stat.ML] UPDATED)
    (2 min) We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of measures, we analyze PDA in regularized empirical / expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.
    HiRID-ICU-Benchmark -- A Comprehensive Machine Learning Benchmark on High-resolution ICU Data. (arXiv:2111.08536v4 [cs.LG] UPDATED)
    (2 min) The recent success of machine learning methods applied to time series collected from Intensive Care Units (ICU) exposes the lack of standardized machine learning benchmarks for developing and comparing such methods. While raw datasets, such as MIMIC-IV or eICU, can be freely accessed on Physionet, the choice of tasks and pre-processing is often chosen ad-hoc for each publication, limiting comparability across publications. In this work, we aim to improve this situation by providing a benchmark covering a large spectrum of ICU-related tasks. Using the HiRID dataset, we define multiple clinically relevant tasks in collaboration with clinicians. In addition, we provide a reproducible end-to-end pipeline to construct both data and labels. Finally, we provide an in-depth analysis of current state-of-the-art sequence modeling methods, highlighting some limitations of deep learning approaches for this type of data. With this benchmark, we hope to give the research community the possibility of a fair comparison of their work.
    On Designing Good Representation Learning Models. (arXiv:2107.05948v3 [cs.LG] UPDATED)
    (2 min) The goal of representation learning is different from the ultimate objective of machine learning such as decision making, it is therefore very difficult to establish clear and direct objectives for training representation learning models. It has been argued that a good representation should disentangle the underlying variation factors, yet how to translate this into training objectives remains unknown. This paper presents an attempt to establish direct training criterions and design principles for developing good representation learning models. We propose that a good representation learning model should be maximally expressive, i.e., capable of distinguishing the maximum number of input configurations. We formally define expressiveness and introduce the maximum expressiveness (MEXS) theorem of a general learning model. We propose to train a model by maximizing its expressiveness while at the same time incorporating general priors such as model smoothness. We present a conscience competitive learning algorithm which encourages the model to reach its MEXS whilst at the same time adheres to model smoothness prior. We also introduce a label consistent training (LCT) technique to boost model smoothness by encouraging it to assign consistent labels to similar samples. We present extensive experimental results to show that our method can indeed design representation learning models capable of developing representations that are as good as or better than state of the art. We also show that our technique is computationally efficient, robust against different parameter settings and can work effectively on a variety of datasets. Code available at https://github.com/qlilx/odgrlm.git
    A Machine Learning and Computer Vision Approach to Rapidly Optimize Multiscale Droplet Generation. (arXiv:2105.13553v5 [cs.LG] UPDATED)
    (2 min) Generating droplets from a continuous stream of fluid requires precise tuning of a device to find optimized control parameter conditions. It is analytically intractable to compute the necessary control parameter values of a droplet-generating device that produces optimized droplets. Furthermore, as the length scale of the fluid flow changes, the formation physics and optimized conditions that induce flow decomposition into droplets also change. Hence, a single proportional integral derivative controller is too inflexible to optimize devices of different length scales or different control parameters, while classification machine learning techniques take days to train and require millions of droplet images. Therefore, the question is posed, can a single method be created that universally optimizes multiple length-scale droplets using only a few data points and is faster than previous approaches? In this paper, a Bayesian optimization and computer vision feedback loop is designed to quickly and reliably discover the control parameter values that generate optimized droplets within different length-scale devices. This method is demonstrated to converge on optimum parameter values using 60 images in only 2.3 hours, 30x faster than previous approaches. Model implementation is demonstrated for two different length-scale devices: a milliscale inkjet device and a microfluidics device.
    Untargeted Poisoning Attack Detection in Federated Learning via Behavior Attestation. (arXiv:2101.10904v3 [cs.CR] UPDATED)
    (2 min) Federated Learning (FL) is a paradigm in Machine Learning (ML) that addresses data privacy, security, access rights and access to heterogeneous information issues by training a global model using distributed nodes. Despite its advantages, there is an increased potential for cyberattacks on FL-based ML techniques that can undermine the benefits. Model-poisoning attacks on FL target the availability of the model. The adversarial objective is to disrupt the training. We propose attestedFL, a defense mechanism that monitors the training of individual nodes through state persistence in order to detect a malicious worker. A fine-grained assessment of the history of the worker permits the evaluation of its behavior in time and results in innovative detection strategies. We present three lines of defense that aim at assessing if the worker is reliable by observing if the node is really training, advancing towards a goal. Our defense exposes an attacker's malicious behavior and removes unreliable nodes from the aggregation process so that the FL process converge faster. Through extensive evaluations and against various adversarial settings, attestedFL increased the accuracy of the model between 12% to 58% under different scenarios such as attacks performed at different stages of convergence, attackers colluding and continuous attacks.
    Robot Learning from Randomized Simulations: A Review. (arXiv:2111.00956v2 [cs.RO] UPDATED)
    (2 min) The rise of deep learning has caused a paradigm shift in robotics research, favoring methods that require large amounts of data. Unfortunately, it is prohibitively expensive to generate such data sets on a physical platform. Therefore, state-of-the-art approaches learn in simulation where data generation is fast as well as inexpensive and subsequently transfer the knowledge to the real robot (sim-to-real). Despite becoming increasingly realistic, all simulators are by construction based on models, hence inevitably imperfect. This raises the question of how simulators can be modified to facilitate learning robot control policies and overcome the mismatch between simulation and reality, often called the 'reality gap'. We provide a comprehensive review of sim-to-real research for robotics, focusing on a technique named 'domain randomization' which is a method for learning from randomized simulations.
    GNMR: A provable one-line algorithm for low rank matrix recovery. (arXiv:2106.12933v2 [math.OC] UPDATED)
    (2 min) Low rank matrix recovery problems, including matrix completion and matrix sensing, appear in a broad range of applications. In this work we present GNMR -- an extremely simple iterative algorithm for low rank matrix recovery, based on a Gauss-Newton linearization. On the theoretical front, we derive recovery guarantees for GNMR in both the matrix sensing and matrix completion settings. Some of these results improve upon the best currently known for other methods. A key property of GNMR is that it implicitly keeps the factor matrices approximately balanced throughout its iterations. On the empirical front, we show that for matrix completion with uniform sampling, GNMR performs better than several popular methods, especially when given very few observations close to the information limit.
    Reinforcement Learning for Ridesharing: An Extended Survey. (arXiv:2105.01099v3 [cs.LG] UPDATED)
    (2 min) In this paper, we present a comprehensive, in-depth survey of the literature on reinforcement learning approaches to decision optimization problems in a typical ridesharing system. Papers on the topics of rideshare matching, vehicle repositioning, ride-pooling, routing, and dynamic pricing are covered. Popular data sets and open simulation environments are also introduced. Subsequently, we discuss a number of challenges and opportunities for reinforcement learning research on this important domain.
    How To Train Your Program: a Probabilistic Programming Pattern for Bayesian Learning From Data. (arXiv:2105.03650v2 [cs.LG] UPDATED)
    (2 min) We present a Bayesian approach to machine learning with probabilistic programs. In our approach, training on available data is implemented as inference on a hierarchical model. The posterior distribution of model parameters is then used to \textit{stochastically condition} a complementary model, such that inference on new data yields the same posterior distribution of latent parameters corresponding to the new data as inference on a hierachical model on the combination of both previously available and new data, at a lower computation cost. We frame the approach as a design pattern of probabilistic programming referred to herein as `stump and fungus', and evaluate realization of the pattern on case studies.
    Credit Assignment in Neural Networks through Deep Feedback Control. (arXiv:2106.07887v3 [cs.LG] UPDATED)
    (2 min) The success of deep learning sparked interest in whether the brain learns by using similar techniques for assigning credit to each synaptic weight for its contribution to the network output. However, the majority of current attempts at biologically-plausible learning methods are either non-local in time, require highly specific connectivity motives, or have no clear link to any known mathematical optimization method. Here, we introduce Deep Feedback Control (DFC), a new learning method that uses a feedback controller to drive a deep neural network to match a desired output target and whose control signal can be used for credit assignment. The resulting learning rule is fully local in space and time and approximates Gauss-Newton optimization for a wide range of feedback connectivity patterns. To further underline its biological plausibility, we relate DFC to a multi-compartment model of cortical pyramidal neurons with a local voltage-dependent synaptic plasticity rule, consistent with recent theories of dendritic processing. By combining dynamical system theory with mathematical optimization theory, we provide a strong theoretical foundation for DFC that we corroborate with detailed results on toy experiments and standard computer-vision benchmarks.
    Iterative Causal Discovery in the Possible Presence of Latent Confounders and Selection Bias. (arXiv:2111.04095v2 [cs.LG] UPDATED)
    (2 min) We present a sound and complete algorithm, called iterative causal discovery (ICD), for recovering causal graphs in the presence of latent confounders and selection bias. ICD relies on the causal Markov and faithfulness assumptions and recovers the equivalence class of the underlying causal graph. It starts with a complete graph, and consists of a single iterative stage that gradually refines this graph by identifying conditional independence (CI) between connected nodes. Independence and causal relations entailed after any iteration are correct, rendering ICD anytime. Essentially, we tie the size of the CI conditioning set to its distance on the graph from the tested nodes, and increase this value in the successive iteration. Thus, each iteration refines a graph that was recovered by previous iterations having smaller conditioning sets -- a higher statistical power -- which contributes to stability. We demonstrate empirically that ICD requires significantly fewer CI tests and learns more accurate causal graphs compared to FCI, FCI+, and RFCI algorithms (code is available at https://github.com/IntelLabs/causality-lab).
    On Centralized and Distributed Mirror Descent: Convergence Analysis Using Quadratic Constraints. (arXiv:2105.14385v2 [math.OC] UPDATED)
    (2 min) Mirror descent (MD) is a powerful first-order optimization technique that subsumes several optimization algorithms including gradient descent (GD). In this work, we develop a semi-definite programming (SDP) framework to analyze the convergence rate of MD in centralized and distributed settings under both strongly convex and non-strongly convex assumptions. We view MD with a dynamical system lens and leverage quadratic constraints (QCs) to provide explicit convergence rates based on Lyapunov stability. For centralized MD under strongly convex assumption, we develop a SDP that certifies exponential convergence rates. We prove that the SDP always has a feasible solution that recovers the optimal GD rate as a special case. We complement our analysis by providing the $O(1/k)$ convergence rate for convex problems. Next, we analyze the convergence of distributed MD and characterize the rate using SDP. To the best of our knowledge, the numerical rate of distributed MD has not been previously reported in the literature. We further prove an $O(1/k)$ convergence rate for distributed MD in the convex setting. Our numerical experiments on strongly convex problems indicate that our framework certifies superior convergence rates compared to the existing rates for distributed GD.
    How Modular Should Neural Module Networks Be for Systematic Generalization?. (arXiv:2106.08170v2 [cs.LG] UPDATED)
    (2 min) Neural Module Networks (NMNs) aim at Visual Question Answering (VQA) via composition of modules that tackle a sub-task. NMNs are a promising strategy to achieve systematic generalization, i.e., overcoming biasing factors in the training distribution. However, the aspects of NMNs that facilitate systematic generalization are not fully understood. In this paper, we demonstrate that the degree of modularity of the NMN have large influence on systematic generalization. In a series of experiments on three VQA datasets (VQA-MNIST, SQOOP, and CLEVR-CoGenT), our results reveal that tuning the degree of modularity, especially at the image encoder stage, reaches substantially higher systematic generalization. These findings lead to new NMN architectures that outperform previous ones in terms of systematic generalization.
    Bandit Data-Driven Optimization. (arXiv:2008.11707v2 [cs.LG] UPDATED)
    (2 min) Applications of machine learning in the non-profit and public sectors often feature an iterative workflow of data acquisition, prediction, and optimization of interventions. There are four major pain points that a machine learning pipeline must overcome in order to be actually useful in these settings: small data, data collected only under the default intervention, unmodeled objectives due to communication gap, and unforeseen consequences of the intervention. In this paper, we introduce bandit data-driven optimization, the first iterative prediction-prescription framework to address these pain points. Bandit data-driven optimization combines the advantages of online bandit learning and offline predictive analytics in an integrated framework. We propose PROOF, a novel algorithm for this framework and formally prove that it has no-regret. Using numerical simulations, we show that PROOF achieves superior performance than existing baseline. We also apply PROOF in a detailed case study of food rescue volunteer recommendation, and show that PROOF as a framework works well with the intricacies of ML models in real-world AI for non-profit and public sector applications.
    MutFormer: A context-dependent transformer-based model to predict pathogenic missense mutations. (arXiv:2110.14746v2 [q-bio.GN] UPDATED)
    (0 min) A missense mutation is a point mutation that results in a substitution of an amino acid in a protein sequence. Currently, missense mutations account for over half of the known variants responsible for Mendelian diseases, but accurate prediction of the pathogenicity of missense variants remains challenging. Recent advances show transformer models - a type of deep neural network - to be particularly powerful at modeling sequence information with context dependence, such as language translation tasks. In this study, we introduce MutFormer, a transformer-based model for the prediction of pathogenic missense mutations, using reference and mutated amino acid sequences from the human genome as the features. We first pre-trained MutFormer on reference protein sequences and mutated protein sequences resulting from common genetic variants (assumed to be benign). We next tested different fine-tuning methods for pathogenicity prediction, and determined that utilizing paired-protein sequences (concatenation of the wildtype and mutated protein sequence) yields the most optimal results. Our results show that MutFormer outperforms a variety of existing tools in pathogenicity prediction, including those using conventional machine-learning approaches (e.g., support vector machine, convolutional neural network, recurrent neural network, etc.), and may complement existing computational predictions and empirically generated functional scores to improve our understanding of disease variants. MutFormer and pre-computed variant scores are publicly available on GitHub at https://github.com/WGLab/mutformer.
    Disentangling Transfer and Interference in Multi-Domain Learning. (arXiv:2107.05445v4 [cs.CV] UPDATED)
    (2 min) Humans are incredibly good at transferring knowledge from one domain to another, enabling rapid learning of new tasks. Likewise, transfer learning has enabled enormous success in many computer vision problems using pretraining. However, the benefits of transfer in multi-domain learning, where a network learns multiple tasks defined by different datasets, has not been adequately studied. Learning multiple domains could be beneficial, or these domains could interfere with each other given limited network capacity. Understanding how deep neural networks of varied capacity facilitate transfer across inputs from different distributions is a critical step towards open world learning. In this work, we decipher the conditions where interference and knowledge transfer occur in multi-domain learning. We propose new metrics disentangling interference and transfer, set up experimental protocols, and examine the roles of network capacity, task grouping, and dynamic loss weighting in reducing interference and facilitating transfer.
    FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. (arXiv:2110.08263v2 [cs.LG] UPDATED)
    (2 min) The recently proposed FixMatch achieved state-of-the-art results on most semi-supervised learning (SSL) benchmarks. However, like other modern SSL algorithms, FixMatch uses a pre-defined constant threshold for all classes to select unlabeled data that contribute to the training, thus failing to consider different learning status and learning difficulties of different classes. To address this issue, we propose Curriculum Pseudo Labeling (CPL), a curriculum learning approach to leverage unlabeled data according to the model's learning status. The core of CPL is to flexibly adjust thresholds for different classes at each time step to let pass informative unlabeled data and their pseudo labels. CPL does not introduce additional parameters or computations (forward or backward propagation). We apply CPL to FixMatch and call our improved algorithm FlexMatch. FlexMatch achieves state-of-the-art performance on a variety of SSL benchmarks, with especially strong performances when the labeled data are extremely limited or when the task is challenging. For example, FlexMatch achieves 13.96% and 18.96% error rate reduction over FixMatch on CIFAR-100 and STL-10 datasets respectively, when there are only 4 labels per class. CPL also significantly boosts the convergence speed, e.g., FlexMatch can use only 1/5 training time of FixMatch to achieve even better performance. Furthermore, we show that CPL can be easily adapted to other SSL algorithms and remarkably improve their performances. We open-source our code at https://github.com/TorchSSL/TorchSSL.
    Provably Efficient Multi-Task Reinforcement Learning with Model Transfer. (arXiv:2107.08622v2 [cs.LG] UPDATED)
    (2 min) We study multi-task reinforcement learning (RL) in tabular episodic Markov decision processes (MDPs). We formulate a heterogeneous multi-player RL problem, in which a group of players concurrently face similar but not necessarily identical MDPs, with a goal of improving their collective performance through inter-player information sharing. We design and analyze an algorithm based on the idea of model transfer, and provide gap-dependent and gap-independent upper and lower bounds that characterize the intrinsic complexity of the problem.
    Learning Optimal Predictive Checklists. (arXiv:2112.01020v2 [cs.LG] UPDATED)
    (2 min) Checklists are simple decision aids that are often used to promote safety and reliability in clinical applications. In this paper, we present a method to learn checklists for clinical decision support. We represent predictive checklists as discrete linear classifiers with binary features and unit weights. We then learn globally optimal predictive checklists from data by solving an integer programming problem. Our method allows users to customize checklists to obey complex constraints, including constraints to enforce group fairness and to binarize real-valued features at training time. In addition, it pairs models with an optimality gap that can inform model development and determine the feasibility of learning sufficiently accurate checklists on a given dataset. We pair our method with specialized techniques that speed up its ability to train a predictive checklist that performs well and has a small optimality gap. We benchmark the performance of our method on seven clinical classification problems, and demonstrate its practical benefits by training a short-form checklist for PTSD screening. Our results show that our method can fit simple predictive checklists that perform well and that can easily be customized to obey a rich class of custom constraints.
    Long-Short Ensemble Network for Bipolar Manic-Euthymic State Recognition Based on Wrist-worn Sensors. (arXiv:2107.00710v2 [cs.LG] UPDATED)
    (2 min) Manic episodes of bipolar disorder can lead to uncritical behaviour and delusional psychosis, often with destructive consequences for those affected and their surroundings. Early detection and intervention of a manic episode are crucial to prevent escalation, hospital admission and premature death. However, people with bipolar disorder may not recognize that they are experiencing a manic episode and symptoms such as euphoria and increased productivity can also deter affected individuals from seeking help. This work proposes to perform user-independent, automatic mood-state detection based on actigraphy and electrodermal activity acquired from a wrist-worn device during mania and after recovery (euthymia). This paper proposes a new deep learning-based ensemble method leveraging long (20h) and short (5 minutes) time-intervals to discriminate between the mood-states. When tested on 47 bipolar patients, the proposed classification scheme achieves an average accuracy of 91.59% in euthymic/manic mood-state recognition.
    Perfect density models cannot guarantee anomaly detection. (arXiv:2012.03808v3 [cs.LG] UPDATED)
    (2 min) Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities through the lens of reparametrization and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for anomaly detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection.
    Neural Latents Benchmark '21: Evaluating latent variable models of neural population activity. (arXiv:2109.04463v4 [cs.LG] UPDATED)
    (2 min) Advances in neural recording present increasing opportunities to study neural activity in unprecedented detail. Latent variable models (LVMs) are promising tools for analyzing this rich activity across diverse neural systems and behaviors, as LVMs do not depend on known relationships between the activity and external experimental variables. However, progress with LVMs for neuronal population activity is currently impeded by a lack of standardization, resulting in methods being developed and compared in an ad hoc manner. To coordinate these modeling efforts, we introduce a benchmark suite for latent variable modeling of neural population activity. We curate four datasets of neural spiking activity from cognitive, sensory, and motor areas to promote models that apply to the wide variety of activity seen across these areas. We identify unsupervised evaluation as a common framework for evaluating models across datasets, and apply several baselines that demonstrate benchmark diversity. We release this benchmark through EvalAI. this http URL
    Fractional SDE-Net: Generation of Time Series Data with Long-term Memory. (arXiv:2201.05974v1 [cs.LG])
    (2 min) In this paper, we focus on generation of time-series data using neural networks. It is often the case that input time-series data, especially taken from real financial markets, is irregularly sampled, and its noise structure is more complicated than i.i.d. type. To generate time series with such a property, we propose fSDE-Net: neural fractional Stochastic Differential Equation Network. It generalizes the neural SDE model by using fractional Brownian motion with Hurst index larger than half, which exhibits long-term memory property. We derive the solver of fSDE-Net and theoretically analyze the existence and uniqueness of the solution to fSDE-Net. Our experiments demonstrate that the fSDE-Net model can replicate distributional properties well.
    A new Sinkhorn algorithm with Deletion and Insertion operations. (arXiv:2111.14565v2 [cs.LG] UPDATED)
    (0 min) This technical report is devoted to the continuous estimation of an epsilon-assignment. Roughly speaking, an epsilon assignment between two sets V1 and V2 may be understood as a bijective mapping between a sub part of V1 and a sub part of V2 . The remaining elements of V1 (not included in this mapping) are mapped onto an epsilon pseudo element of V2 . We say that such elements are deleted. Conversely, the remaining elements of V2 correspond to the image of the epsilon pseudo element of V1. We say that these elements are inserted. As a result our method provides a result similar to the one of the Sinkhorn algorithm with the additional ability to reject some elements which are either inserted or deleted. It thus naturally handles sets V1 and V2 of different sizes and decides mappings/insertions/deletions in a unified way. Our algorithms are iterative and differentiable and may thus be easily inserted within a backpropagation based learning framework such as artificial neural networks.
    Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence. (arXiv:2106.03743v5 [cs.LG] UPDATED)
    (2 min) We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity. To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.
    Machine Learning for Mechanical Ventilation Control. (arXiv:2102.06779v4 [cs.LG] UPDATED)
    (0 min) We consider the problem of controlling an invasive mechanical ventilator for pressure-controlled ventilation: a controller must let air in and out of a sedated patient's lungs according to a trajectory of airway pressures specified by a clinician. Hand-tuned PID controllers and similar variants have comprised the industry standard for decades, yet can behave poorly by over- or under-shooting their target or oscillating rapidly. We consider a data-driven machine learning approach: First, we train a simulator based on data we collect from an artificial lung. Then, we train deep neural network controllers on these simulators.We show that our controllers are able to track target pressure waveforms significantly better than PID controllers. We further show that a learned controller generalizes across lungs with varying characteristics much more readily than PID controllers do.
    Score-based Generative Neural Networks for Large-Scale Optimal Transport. (arXiv:2110.03237v3 [cs.LG] UPDATED)
    (0 min) We consider the fundamental problem of sampling the optimal transport coupling between given source and target distributions. In certain cases, the optimal transport plan takes the form of a one-to-one mapping from the source support to the target support, but learning or even approximating such a map is computationally challenging for large and high-dimensional datasets due to the high cost of linear programming routines and an intrinsic curse of dimensionality. We study instead the Sinkhorn problem, a regularized form of optimal transport whose solutions are couplings between the source and the target distribution. We introduce a novel framework for learning the Sinkhorn coupling between two distributions in the form of a score-based generative model. Conditioned on source data, our procedure iterates Langevin Dynamics to sample target data according to the regularized optimal coupling. Key to this approach is a neural network parametrization of the Sinkhorn problem, and we prove convergence of gradient descent with respect to network parameters in this formulation. We demonstrate its empirical success on a variety of large scale optimal transport tasks.
    Hierarchical VAEs Know What They Don't Know. (arXiv:2102.08248v7 [cs.LG] UPDATED)
    (0 min) Deep generative models have been demonstrated as state-of-the-art density estimators. Yet, recent work has found that they often assign a higher likelihood to data from outside the training distribution. This seemingly paradoxical behavior has caused concerns over the quality of the attained density estimates. In the context of hierarchical variational autoencoders, we provide evidence to explain this behavior by out-of-distribution data having in-distribution low-level features. We argue that this is both expected and desirable behavior. With this insight in hand, we develop a fast, scalable and fully unsupervised likelihood-ratio score for OOD detection that requires data to be in-distribution across all feature-levels. We benchmark the method on a vast set of data and model combinations and achieve state-of-the-art results on out-of-distribution detection.
    ViTBIS: Vision Transformer for Biomedical Image Segmentation. (arXiv:2201.05920v1 [eess.IV])
    (0 min) In this paper, we propose a novel network named Vision Transformer for Biomedical Image Segmentation (ViTBIS). Our network splits the input feature maps into three parts with $1\times 1$, $3\times 3$ and $5\times 5$ convolutions in both encoder and decoder. Concat operator is used to merge the features before being fed to three consecutive transformer blocks with attention mechanism embedded inside it. Skip connections are used to connect encoder and decoder transformer blocks. Similarly, transformer blocks and multi scale architecture is used in decoder before being linearly projected to produce the output segmentation map. We test the performance of our network using Synapse multi-organ segmentation dataset, Automated cardiac diagnosis challenge dataset, Brain tumour MRI segmentation dataset and Spleen CT segmentation dataset. Without bells and whistles, our network outperforms most of the previous state of the art CNN and transformer based models using Dice score and the Hausdorff distance as the evaluation metrics.
    BERTMap: A BERT-based Ontology Alignment System. (arXiv:2112.02682v3 [cs.AI] UPDATED)
    (0 min) Ontology alignment (a.k.a ontology matching (OM)) plays a critical role in knowledge integration. Owing to the success of machine learning in many domains, it has been applied in OM. However, the existing methods, which often adopt ad-hoc feature engineering or non-contextual word embeddings, have not yet outperformed rule-based systems especially in an unsupervised setting. In this paper, we propose a novel OM system named BERTMap which can support both unsupervised and semi-supervised settings. It first predicts mappings using a classifier based on fine-tuning the contextual embedding model BERT on text semantics corpora extracted from ontologies, and then refines the mappings through extension and repair by utilizing the ontology structure and logic. Our evaluation with three alignment tasks on biomedical ontologies demonstrates that BERTMap can often perform better than the leading OM systems LogMap and AML.
    Deep learning via LSTM models for COVID-19 infection forecasting in India. (arXiv:2101.11881v2 [cs.LG] UPDATED)
    (0 min) The COVID-19 pandemic continues to have major impact to health and medical infrastructure, economy, and agriculture. Prominent computational and mathematical models have been unreliable due to the complexity of the spread of infections. Moreover, lack of data collection and reporting makes modelling attempts difficult and unreliable. Hence, we need to re-look at the situation with reliable data sources and innovative forecasting models. Deep learning models such as recurrent neural networks are well suited for modelling spatiotemporal sequences. In this paper, we apply recurrent neural networks such as long short term memory (LSTM), bidirectional LSTM, and encoder-decoder LSTM models for multi-step (short-term) COVID-19 infection forecasting. We select Indian states with COVID-19 hotpots and capture the first (2020) and second (2021) wave of infections and provide two months ahead forecast. Our model predicts that the likelihood of another wave of infections in October and November 2021 is low; however, the authorities need to be vigilant given emerging variants of the virus. The accuracy of the predictions motivate the application of the method in other countries and regions. Nevertheless, the challenges in modelling remain due to the reliability of data and difficulties in capturing factors such as population density, logistics, and social aspects such as culture and lifestyle.
    Towards Communication-Efficient and Privacy-Preserving Federated Representation Learning. (arXiv:2109.14611v2 [cs.LG] UPDATED)
    (2 min) This paper investigates the feasibility of federated representation learning under the constraints of communication cost and privacy protection. Existing works either conduct annotation-guided local training which requires frequent communication or aggregates the client models via weight averaging which has potential risks of privacy exposure. To tackle the above problems, we first identify that self-supervised contrastive local training is robust against the non-identically distributed data, which provides the feasibility of longer local training and thus reduces the communication cost. Then based on the aforementioned robustness, we propose a novel Federated representation Learning framework with Ensemble Similarity Distillation~(FLESD) that utilizes this robustness. At each round of communication, the server first gathers a fraction of the clients' inferred similarity matrices on a public dataset. Then it ensembles the similarity matrices and train the global model via similarity distillation. We verify the effectiveness of FLESD by a series of empirical experiments and show that, despite stricter constraints, it achieves comparable results under multiple settings on multiple datasets.
    Training Spiking Neural Networks Using Lessons From Deep Learning. (arXiv:2109.12894v4 [cs.NE] UPDATED)
    (2 min) The brain is the perfect place to look for inspiration to develop more efficient neural networks. The inner workings of our synapses and neurons provide a glimpse at what the future of deep learning might look like. This paper serves as a tutorial and perspective showing how to apply the lessons learnt from several decades of research in deep learning, gradient descent, backpropagation and neuroscience to biologically plausible spiking neural neural networks. We also explore the delicate interplay between encoding data as spikes and the learning process; the challenges and solutions of applying gradient-based learning to spiking neural networks; the subtle link between temporal backpropagation and spike timing dependent plasticity, and how deep learning might move towards biologically plausible online learning. Some ideas are well accepted and commonly used amongst the neuromorphic engineering community, while others are presented or justified for the first time here. A series of companion interactive tutorials complementary to this paper using our Python package, snnTorch, are also made available: https://snntorch.readthedocs.io/en/latest/tutorials/index.html
    Robust Aggregation for Federated Learning. (arXiv:1912.13445v2 [stat.ML] UPDATED)
    (2 min) Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of iterations of a regular non-robust averaging oracle. The robust aggregation oracle is privacy-preserving, similar to the non-robust secure average oracle it builds upon. We establish its convergence for least squares estimation of additive models. We provide experimental results with linear models and deep networks for three tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption. Two variants, a faster one with one-step robust aggregation and another one with on-device personalization, round off the paper.
    Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning. (arXiv:2010.16103v5 [cs.LG] UPDATED)
    (3 min) In this paper, we provide a theory of using graph neural networks (GNNs) for multi-node representation learning (where we are interested in learning a representation for a set of more than one node, such as link). We know that GNN is designed to learn single-node representations. When we want to learn a node set representation involving multiple nodes, a common practice in previous works is to directly aggregate the single-node representations obtained by a GNN into a joint node set representation. In this paper, we show a fundamental constraint of such an approach, namely the inability to capture the dependence between nodes in the node set, and argue that directly aggregating individual node representations does not lead to an effective joint representation for multiple nodes. Then, we notice that a few previous successful works for multi-node representation learning, including SEAL, Distance Encoding, and ID-GNN, all used node labeling. These methods first label nodes in the graph according to their relationships with the target node set before applying a GNN. Then, the node representations obtained in the labeled graph are aggregated into a node set representation. By investigating their inner mechanisms, we unify these node labeling techniques into a single and most general form -- labeling trick. We prove that with labeling trick a sufficiently expressive GNN learns the most expressive node set representations, thus in principle solves any joint learning tasks over node sets. Experiments on one important two-node representation learning task, link prediction, verified our theory. Our work explains the superior performance of previous node-labeling-based methods, and establishes a theoretical foundation of using GNNs for multi-node representation learning.
    Algorithmic encoding of protected characteristics in image-based models for disease detection. (arXiv:2110.14755v3 [cs.LG] UPDATED)
    (2 min) It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. A machine learning model may pick up undesirable correlations, for example, between a patient's racial identity and clinical outcome. Such correlations are often present in (historical) data used for model development. There has been an increase in studies reporting biases in image-based disease detection models. Besides the scarcity of data from underserved populations, very little is known about how these biases are encoded and how one may reduce or even remove disparate performance. There are concerns that an algorithm may recognize patient characteristics such as biological sex or racial identity, and then directly or indirectly use this information when making predictions. But it remains unclear how we can establish whether such information is actually used. This article aims to shed some light on these issues by exploring different methodology for assessing the inner working of disease detection models. We explore multitask learning and model inspection to assess the relationship between protected characteristics and prediction of disease. We believe our analysis framework could provide valuable insights in future studies in medical imaging AI. Our findings also call for further research to better understand the underlying causes of performance disparities.
    Regularising Inverse Problems with Generative Machine Learning Models. (arXiv:2107.11191v2 [eess.IV] UPDATED)
    (2 min) Deep neural network approaches to inverse imaging problems have produced impressive results in the last few years. In this paper, we consider the use of generative models in a variational regularisation approach to inverse problems. The considered regularisers penalise images that are far from the range of a generative model that has learned to produce images similar to a training dataset. We name this family \textit{generative regularisers}. The success of generative regularisers depends on the quality of the generative model and so we propose a set of desired criteria to assess models and guide future research. In our numerical experiments, we evaluate three common generative models, autoencoders, variational autoencoders and generative adversarial networks, against our desired criteria. We also test three different generative regularisers on the inverse problems of deblurring, deconvolution, and tomography. We show that the success of solutions restricted to lie exactly in the range of the generator is highly dependent on the ability of the generative model but that allowing small deviations from the range of the generator produces more consistent results.
    A Sparse Model-inspired Deep Thresholding Network for Exponential Signal Reconstruction -- Application in Fast Biological Spectroscopy. (arXiv:2012.14830v5 [cs.LG] UPDATED)
    (2 min) The non-uniform sampling is a powerful approach to enable fast acquisition but requires sophisticated reconstruction algorithms. Faithful reconstruction from partial sampled exponentials is highly expected in general signal processing and many applications. Deep learning has shown astonishing potential in this field but many existing problems, such as lack of robustness and explainability, greatly limit its applications. In this work, by combining merits of the sparse model-based optimization method and data-driven deep learning, we propose a deep learning architecture for spectra reconstruction from undersampled data, called MoDern. It follows the iterative reconstruction in solving a sparse model to build the neural network and we elaborately design a learnable soft-thresholding to adaptively eliminate the spectrum artifacts introduced by undersampling. Extensive results on both synthetic and biological data show that MoDern enables more robust, high-fidelity, and ultra-fast reconstruction than the state-of-the-art methods. Remarkably, MoDern has a small number of network parameters and is trained on solely synthetic data while generalizing well to biological data in various scenarios. Furthermore, we extend it to an open-access and easy-to-use cloud computing platform (XCloud-MoDern), contributing a promising strategy for further development of biological applications.
    A three layer neural network can represent any multivariate function. (arXiv:2012.03016v2 [cs.LG] UPDATED)
    (2 min) In 1987, Hecht-Nielsen showed that any continuous multivariate function can be implemented by a certain type three-layer neural network. This result was very much discussed in neural network literature. In this paper we prove that not only continuous functions but also all discontinuous functions can be implemented by such neural networks.
    On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds. (arXiv:2111.08759v2 [cs.DC] UPDATED)
    (2 min) With the growing amount of data, data processing workloads and the management of their resource usage becomes increasingly important. Since managing a dedicated infrastructure is in many situations infeasible or uneconomical, users progressively execute their respective workloads in the cloud. As the configuration of workloads and resources is often challenging, various methods have been proposed that either quickly profile towards a good configuration or determine one based on data from previous runs. Still, performance data to train such methods is often lacking and must be costly collected. In this paper, we propose a collaborative approach for sharing anonymized workload execution traces among users, mining them for general patterns, and exploiting clusters of historical workloads for future optimizations. We evaluate our prototype implementation for mining workload execution graphs on a publicly available trace dataset and demonstrate the predictive value of workload clusters determined through traces only.
    Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning. (arXiv:2010.03950v2 [cs.LG] UPDATED)
    (2 min) Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible -- to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.
    Cooperative Multi-Agent Deep Reinforcement Learning for Reliable Surveillance via Autonomous Multi-UAV Control. (arXiv:2201.05843v1 [eess.SY])
    (2 min) CCTV-based surveillance using unmanned aerial vehicles (UAVs) is considered a key technology for security in smart city environments. This paper creates a case where the UAVs with CCTV-cameras fly over the city area for flexible and reliable surveillance services. UAVs should be deployed to cover a large area while minimize overlapping and shadow areas for a reliable surveillance system. However, the operation of UAVs is subject to high uncertainty, necessitating autonomous recovery systems. This work develops a multi-agent deep reinforcement learning-based management scheme for reliable industry surveillance in smart city applications. The core idea this paper employs is autonomously replenishing the UAV's deficient network requirements with communications. Via intensive simulations, our proposed algorithm outperforms the state-of-the-art algorithms in terms of surveillance coverage, user support capability, and computational costs.
    Dynamic Differential-Privacy Preserving SGD. (arXiv:2111.00173v3 [cs.LG] UPDATED)
    (2 min) The vanilla Differentially-Private Stochastic Gradient Descent (DP-SGD), including DP-Adam and other variants, ensures the privacy of training data by uniformly distributing privacy costs across training steps. The equivalent privacy costs controlled by maintaining the same gradient clipping thresholds and noise powers in each step result in unstable updates and a lower model accuracy when compared to the non-DP counterpart. In this paper, we propose the dynamic DP-SGD (along with dynamic DP-Adam, and others) to reduce the performance loss gap while maintaining privacy by dynamically adjusting clipping thresholds and noise powers while adhering to a total privacy budget constraint. Extensive experiments on a variety of deep learning tasks, including image classification, natural language processing, and federated learning, demonstrate that the proposed dynamic DP-SGD algorithm stabilizes updates and, as a result, significantly improves model accuracy in the strong privacy protection region when compared to the vanilla DP-SGD. We also conduct theoretical analysis to better understand the privacy-utility trade-off with dynamic DP-SGD, as well as to learn why Dynamic DP-SGD can outperform vanilla DP-SGD.
    Uniform Sampling over Episode Difficulty. (arXiv:2108.01662v2 [cs.LG] UPDATED)
    (2 min) Episodic training is a core ingredient of few-shot learning to train models on tasks with limited labelled data. Despite its success, episodic training remains largely understudied, prompting us to ask the question: what is the best way to sample episodes? In this paper, we first propose a method to approximate episode sampling distributions based on their difficulty. Building on this method, we perform an extensive analysis and find that sampling uniformly over episode difficulty outperforms other sampling schemes, including curriculum and easy-/hard-mining. As the proposed sampling method is algorithm agnostic, we can leverage these insights to improve few-shot learning accuracies across many episodic training algorithms. We demonstrate the efficacy of our method across popular few-shot learning datasets, algorithms, network architectures, and protocols.
    Perspective Transformation Layer. (arXiv:2201.05706v1 [cs.CV])
    (2 min) Incorporating geometric transformations that reflect the relative position changes between an observer and an object into computer vision and deep learning models has attracted much attention in recent years. However, the existing proposals mainly focus on affine transformations that cannot fully show viewpoint changes. Furthermore, current solutions often apply a neural network module to learn a single transformation matrix, which ignores the possibility for various viewpoints and creates extra to-be-trained module parameters. In this paper, a layer (PT layer) is proposed to learn the perspective transformations that not only model the geometries in affine transformation but also reflect the viewpoint changes. In addition, being able to be directly trained with gradient descent like traditional layers such as convolutional layers, a single proposed PT layer can learn an adjustable number of multiple viewpoints without training extra module parameters. The experiments and evaluations confirm the superiority of the proposed PT layer.
    Deciding Not To Decide. (arXiv:2201.05818v1 [cs.AI])
    (2 min) Sometimes unexpected, novel, unconceivable events enter our lives. The cause-effect mappings that usually guide our behaviour are destroyed. Surprised and shocked by possibilities that we had never imagined, we are unable to make any decision beyond mere routine. Among them there are decisions, such as making investments, that are essential for the long-term survival of businesses as well as the economy at large. We submit that the standard machinery of utility maximization does not apply, but we propose measures inspired by scenario planning and graph analysis, pointing to solutions being explored in machine learning.
    Profitable Strategy Design by Using Deep Reinforcement Learning for Trades on Cryptocurrency Markets. (arXiv:2201.05906v1 [q-fin.TR])
    (2 min) Deep Reinforcement Learning solutions have been applied to different control problems with outperforming and promising results. In this research work we have applied Proximal Policy Optimization, Soft Actor-Critic and Generative Adversarial Imitation Learning to strategy design problem of three cryptocurrency markets. Our input data includes price data and technical indicators. We have implemented a Gym environment based on cryptocurrency markets to be used with the algorithms. Our test results on unseen data shows a great potential for this approach in helping investors with an expert system to exploit the market and gain profit. Our highest gain for an unseen 66 day span is 4850 US dollars per 10000 US dollars investment. We also discuss on how a specific hyperparameter in the environment design can be used to adjust risk in the generated strategies.
    A flow-based IDS using Machine Learning in eBPF. (arXiv:2102.09980v2 [cs.CR] UPDATED)
    (2 min) eBPF is a new technology which allows dynamically loading pieces of code into the Linux kernel. It can greatly speed up networking since it enables the kernel to process certain packets without the involvement of a userspace program. So far eBPF has been used for simple packet filtering applications such as firewalls or Denial of Service protection. We show that it is possible to develop a flow based network intrusion detection system based on machine learning entirely in eBPF. Our solution uses a decision tree and decides for each packet whether it is malicious or not, considering the entire previous context of the network flow. We achieve a performance increase of over 20% compared to the same solution implemented as a userspace program.
    Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. (arXiv:2201.05989v1 [cs.CV])
    (2 min) Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations: a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. We leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations. We achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of ${1920\!\times\!1080}$.
    Universal Online Learning: an Optimistically Universal Learning Rule. (arXiv:2201.05947v1 [cs.LG])
    (2 min) We study the subject of universal online learning with non-i.i.d. processes for bounded losses. The notion of an universally consistent learning was defined by Hanneke in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. The case of unbounded losses is very restrictive, since the learnable processes almost surely visit a finite number of points and as a result, simple memorization is optimistically universal. We focus on the bounded setting and give a complete characterization of the processes admitting strong and weak universal learning. We further show that k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN which is optimistically universal for general input and value spaces in both strong and weak setting. This closes all COLT 2021 open problems posed by Hanneke on universal online learning.
    VAC-CNN: A Visual Analytics System for Comparative Studies of Deep Convolutional Neural Networks. (arXiv:2110.13252v2 [cs.LG] UPDATED)
    (2 min) The rapid development of Convolutional Neural Networks (CNNs) in recent years has triggered significant breakthroughs in many machine learning (ML) applications. The ability to understand and compare various CNN models available is thus essential. The conventional approach with visualizing each model's quantitative features, such as classification accuracy and computational complexity, is not sufficient for a deeper understanding and comparison of the behaviors of different models. Moreover, most of the existing tools for assessing CNN behaviors only support comparison between two models and lack the flexibility of customizing the analysis tasks according to user needs. This paper presents a visual analytics system, VAC-CNN (Visual Analytics for Comparing CNNs), that supports the in-depth inspection of a single CNN model as well as comparative studies of two or more models. The ability to compare a larger number of (e.g., tens of) models especially distinguishes our system from previous ones. With a carefully designed model visualization and explaining support, VAC-CNN facilitates a highly interactive workflow that promptly presents both quantitative and qualitative information at each analysis stage. We demonstrate VAC-CNN's effectiveness for assisting novice ML practitioners in evaluating and comparing multiple CNN models through two use cases and one preliminary evaluation study using the image classification tasks on the ImageNet dataset.
    Active Learning for Human-in-the-Loop Customs Inspection. (arXiv:2010.14282v2 [cs.LG] UPDATED)
    (2 min) We study the human-in-the-loop customs inspection scenario, where an AI-assisted algorithm supports customs officers by recommending a set of imported goods to be inspected. If the inspected items are fraudulent, the officers can levy extra duties. Th formed logs are then used as additional training data for successive iterations. Choosing to inspect suspicious items first leads to an immediate gain in customs revenue, yet such inspections may not bring new insights for learning dynamic traffic patterns. On the other hand, inspecting uncertain items can help acquire new knowledge, which will be used as a supplementary training resource to update the selection systems. Based on multiyear customs datasets obtained from three countries, we demonstrate that some degree of exploration is necessary to cope with domain shifts in trade data. The results show that a hybrid strategy of selecting likely fraudulent and uncertain items will eventually outperform the exploitation-only strategy.
    Generating synthetic transactional profiles. (arXiv:2111.01531v2 [cs.LG] UPDATED)
    (2 min) Financial institutions use clients' payment transactions in numerous banking applications. Transactions are very personal and rich in behavioural patterns, often unique to individuals, which make them equivalent to personally identifiable information in some cases. In this paper, we generate synthetic transactional profiles using machine learning techniques with the goal to preserve both data utility and privacy. A challenge we faced was to deal with sparse vectors due to the few spending categories a client uses compared to all the ones available. We measured data utility by calculating common insights used by the banking industry on both the original and the synthetic data-set. Our approach shows that neural network models can generate valuable synthetic data in such context. Finally, we tried privacy-preserving techniques and observed its effect on models' performances.
    First steps on Gamification of Lung Fluid Cells Annotations in the Flower Domain. (arXiv:2111.03663v2 [eess.IV] UPDATED)
    (2 min) Annotating data, especially in the medical domain, requires expert knowledge and a lot of effort. This limits the amount and/or usefulness of available medical data sets for experimentation. Therefore, developing strategies to increase the number of annotations while lowering the needed domain knowledge is of interest. A possible strategy is the use of gamification, i.e. transforming the annotation task into a game. We propose an approach to gamify the task of annotating lung fluid cells from pathological whole slide images (WSIs). As the domain is unknown to non-expert annotators, we transform images of cells to the domain of flower images using a CycleGAN architecture. In this more assessable domain, non-expert annotators can be (t)asked to annotate different kinds of flowers in a playful setting. In order to provide a proof of concept, this work shows that the domain transfer is possible by evaluating an image classification network trained on real cell images and tested on the cell images generated by the CycleGAN network (reconstructed cell images) as well as real cell images. The classification network reaches an average accuracy of 94.73 % on the original lung fluid cells and 95.25 % on the transformed lung fluid cells, respectively. Our study lays the foundation for future research on gamification using CycleGANs.
    CLUE: Contextualised Unified Explainable Learning of User Engagement in Video Lectures. (arXiv:2201.05651v1 [cs.AI])
    (2 min) Predicting contextualised engagement in videos is a long-standing problem that has been popularly attempted by exploiting the number of views or the associated likes using different computational methods. The recent decade has seen a boom in online learning resources, and during the pandemic, there has been an exponential rise of online teaching videos without much quality control. The quality of the content could be improved if the creators could get constructive feedback on their content. Employing an army of domain expert volunteers to provide feedback on the videos might not scale. As a result, there has been a steep rise in developing computational methods to predict a user engagement score that is indicative of some form of possible user engagement, i.e., to what level a user would tend to engage with the content. A drawback in current methods is that they model various features separately, in a cascaded approach, that is prone to error propagation. Besides, most of them do not provide crucial explanations on how the creator could improve their content. In this paper, we have proposed a new unified model, CLUE for the educational domain, which learns from the features extracted from freely available public online teaching videos and provides explainable feedback on the video along with a user engagement score. Given the complexity of the task, our unified framework employs different pre-trained models working together as an ensemble of classifiers. Our model exploits various multi-modal features to model the complexity of language, context agnostic information, textual emotion of the delivered content, animation, speaker's pitch and speech emotions. Under a transfer learning setup, the overall model, in the unified space, is fine-tuned for downstream applications.
    Identifying optimal cycles in quantum thermal machines with reinforcement-learning. (arXiv:2108.13525v2 [quant-ph] UPDATED)
    (2 min) The optimal control of open quantum systems is a challenging task but has a key role in improving existing quantum information processing technologies. We introduce a general framework based on Reinforcement Learning to discover optimal thermodynamic cycles that maximize the power of out-of-equilibrium quantum heat engines and refrigerators. We apply our method, based on the soft actor-critic algorithm, to three systems: a benchmark two-level system heat engine, where we find the optimal known cycle; an experimentally realistic refrigerator based on a superconducting qubit that generates coherence, where we find a non-intuitive control sequence that outperform previous cycles proposed in literature; a heat engine based on a quantum harmonic oscillator, where we find a cycle with an elaborate structure that outperforms the optimized Otto cycle. We then evaluate the corresponding efficiency at maximum power.
    Sequence Parallelism: Long Sequence Training from System Perspective. (arXiv:2105.13120v2 [cs.LG] UPDATED)
    (2 min) Self-attention suffers from quadratic memory requirements with respect to the sequence length. In this work, we propose sequence parallelism, a memory-efficient parallelism method to help us break input sequence length limitation and train with longer sequences on GPUs efficiently. Our approach is compatible with most existing parallelisms. More importantly, we no longer require a single device to hold the whole sequence. Specifically, we split the input sequence into multiple chunks and feed each chunk into its corresponding device (i.e. GPU). To compute the attention output, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA). Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved $13.7\times$ and $3.0\times$ maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs.
    Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture. (arXiv:2105.13524v2 [cs.LG] UPDATED)
    (0 min) The generalization ability of most meta-reinforcement learning (meta-RL) methods is largely limited to test tasks that are sampled from the same distribution used to sample training tasks. To overcome the limitation, we propose Latent Dynamics Mixture (LDM) that trains a reinforcement learning agent with imaginary tasks generated from mixtures of learned latent dynamics. By training a policy on mixture tasks along with original training tasks, LDM allows the agent to prepare for unseen test tasks during training and prevents the agent from overfitting the training tasks. LDM significantly outperforms standard meta-RL methods in test returns on the gridworld navigation and MuJoCo tasks where we strictly separate the training task distribution and the test task distribution.
    Ensembling Off-the-shelf Models for GAN Training. (arXiv:2112.09130v2 [cs.CV] UPDATED)
    (0 min) The advent of large-scale training has produced a cornucopia of powerful visual recognition models. However, generative models, such as GANs, have traditionally been trained from scratch in an unsupervised manner. Can the collective "knowledge" from a large bank of pretrained vision models be leveraged to improve GAN training? If so, with so many models to choose from, which one(s) should be selected, and in what manner are they most effective? We find that pretrained computer vision models can significantly improve performance when used in an ensemble of discriminators. Notably, the particular subset of selected models greatly affects performance. We propose an effective selection mechanism, by probing the linear separability between real and fake samples in pretrained model embeddings, choosing the most accurate model, and progressively adding it to the discriminator ensemble. Interestingly, our method can improve GAN training in both limited data and large-scale settings. Given only 10k training samples, our FID on LSUN Cat matches the StyleGAN2 trained on 1.6M images. On the full dataset, our method improves FID by 1.5x to 2x on cat, church, and horse categories of LSUN.
    Risk-Monotonicity in Statistical Learning. (arXiv:2011.14126v5 [cs.LG] UPDATED)
    (0 min) Acquisition of data is a difficult task in many applications of machine learning, and it is only natural that one hopes and expects the population risk to decrease (better performance) monotonically with increasing data points. It turns out, somewhat surprisingly, that this is not the case even for the most standard algorithms that minimize the empirical risk. Non-monotonic behavior of the risk and instability in training have manifested and appeared in the popular deep learning paradigm under the description of double descent. These problems highlight the current lack of understanding of learning algorithms and generalization. It is, therefore, crucial to pursue this concern and provide a characterization of such behavior. In this paper, we derive the first consistent and risk-monotonic (in high probability) algorithms for a general statistical learning setting under weak assumptions, consequently answering some questions posed by Viering et al. 2019 on how to avoid non-monotonic behavior of risk curves. We further show that risk monotonicity need not necessarily come at the price of worse excess risk rates. To achieve this, we derive new empirical Bernstein-like concentration inequalities of independent interest that hold for certain non-i.i.d.~processes such as Martingale Difference Sequences.
    Learning Dynamical Systems with Side Information. (arXiv:2008.10135v2 [math.OC] UPDATED)
    (2 min) We present a mathematical and computational framework for the problem of learning a dynamical system from noisy observations of a few trajectories and subject to side information. Side information is any knowledge we might have about the dynamical system we would like to learn besides trajectory data. It is typically inferred from domain-specific knowledge or basic principles of a scientific discipline. We are interested in explicitly integrating side information into the learning process in order to compensate for scarcity of trajectory observations. We identify six types of side information that arise naturally in many applications and lead to convex constraints in the learning problem. First, we show that when our model for the unknown dynamical system is parameterized as a polynomial, one can impose our side information constraints computationally via semidefinite programming. We then demonstrate the added value of side information for learning the dynamics of basic models in physics and cell biology, as well as for learning and controlling the dynamics of a model in epidemiology. Finally, we study how well polynomial dynamical systems can approximate continuously-differentiable ones while satisfying side information (either exactly or approximately). Our overall learning methodology combines ideas from convex optimization, real algebra, dynamical systems, and functional approximation theory, and can potentially lead to new synergies between these areas.
    CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks. (arXiv:2201.05729v1 [cs.CV])
    (0 min) Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this potential on a subset of tasks, important questions remain: 1) What is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches without impacting inference or pretraining complexity? In this work, we seek to answer these questions through two key contributions. First, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data availability constraints and conditions of domain shift. Second, we propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures using a dynamically weighted objective applied to adaptively selected tokens per instance. Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to 71.3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only. On SNLI-VE, CLIP-TD produces significant gains in low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works utilizing CLIP for finetuning, as well as baseline naive distillation approaches. Code will be made available.
    Quantum Ensemble for Classification. (arXiv:2007.01028v3 [cs.LG] UPDATED)
    (0 min) A powerful way to improve performance in machine learning is to construct an ensemble that combines the predictions of multiple models. Ensemble methods are often much more accurate and lower variance than the individual classifiers that make them up but have high requirements in terms of memory and computational time. In fact, a large number of alternative algorithms is usually adopted, each requiring to query all available data. We propose a new quantum algorithm that exploits quantum superposition, entanglement and interference to build an ensemble of classification models. Thanks to the generation of the several quantum trajectories in superposition, we obtain $B$ transformations of the quantum state which encodes the training set in only $log\left(B\right)$ operations. This implies exponential growth of the ensemble size while increasing linearly the depth of the correspondent circuit. Furthermore, when considering the overall cost of the algorithm, we show that the training of a single weak classifier impacts additively the overall time complexity rather than multiplicatively, as it usually happens in classical ensemble methods. We also present small-scale experiments on real-world datasets, defining a quantum version of the cosine classifier and using the IBM qiskit environment to show how the algorithms work.
    Recursive Bayesian Networks: Generalising and Unifying Probabilistic Context-Free Grammars and Dynamic Bayesian Networks. (arXiv:2111.01853v4 [cs.LG] UPDATED)
    (0 min) Probabilistic context-free grammars (PCFGs) and dynamic Bayesian networks (DBNs) are widely used sequence models with complementary strengths and limitations. While PCFGs allow for nested hierarchical dependencies (tree structures), their latent variables (non-terminal symbols) have to be discrete. In contrast, DBNs allow for continuous latent variables, but the dependencies are strictly sequential (chain structure). Therefore, neither can be applied if the latent variables are assumed to be continuous and also to have a nested hierarchical dependency structure. In this paper, we present Recursive Bayesian Networks (RBNs), which generalise and unify PCFGs and DBNs, combining their strengths and containing both as special cases. RBNs define a joint distribution over tree-structured Bayesian networks with discrete or continuous latent variables. The main challenge lies in performing joint inference over the exponential number of possible structures and the continuous variables. We provide two solutions: 1) For arbitrary RBNs, we generalise inside and outside probabilities from PCFGs to the mixed discrete-continuous case, which allows for maximum posterior estimates of the continuous latent variables via gradient descent, while marginalising over network structures. 2) For Gaussian RBNs, we additionally derive an analytic approximation, allowing for robust parameter optimisation and Bayesian inference. The capacity and diverse applications of RBNs are illustrated on two examples: In a quantitative evaluation on synthetic data, we demonstrate and discuss the advantage of RBNs for segmentation and tree induction from noisy sequences, compared to change point detection and hierarchical clustering. In an application to musical data, we approach the unsolved problem of hierarchical music analysis from the raw note level and compare our results to expert annotations.
    Embodied Self-supervised Learning by Coordinated Sampling and Training. (arXiv:2006.13350v2 [eess.AS] UPDATED)
    (0 min) Self-supervised learning can significantly improve the performance of downstream tasks, however, the dimensions of learned representations normally lack explicit physical meanings. In this work, we propose a novel self-supervised approach to solve inverse problems by employing the corresponding physical forward process so that the learned representations can have explicit physical meanings. The proposed approach works in an analysis-by-synthesis manner to learn an inference network by iteratively sampling and training. At the sampling step, given observed data, the inference network is used to approximate the intractable posterior, from which we sample input parameters and feed them to a physical process to generate data in the observational space; At the training step, the same network is optimized with the sampled paired data. We prove the feasibility of the proposed method by tackling the acoustic-to-articulatory inversion problem to infer articulatory information from speech. Given an articulatory synthesizer, an inference model can be trained completely from scratch with random initialization. Our experiments demonstrate that the proposed method can converge steadily and the network learns to control the articulatory synthesizer to speak like a human. We also demonstrate that trained models can generalize well to unseen speakers or even new languages, and performance can be further improved through self-adaptation.
    SimCVD: Simple Contrastive Voxel-Wise Representation Distillation for Semi-Supervised Medical Image Segmentation. (arXiv:2108.06227v3 [cs.CV] UPDATED)
    (0 min) Automated segmentation in medical image analysis is a challenging task that requires a large amount of manually labeled data. However, most existing learning-based approaches usually suffer from limited manually annotated medical data, which poses a major practical problem for accurate and robust medical image segmentation. In addition, most existing semi-supervised approaches are usually not robust compared with the supervised counterparts, and also lack explicit modeling of geometric structure and semantic information, both of which limit the segmentation accuracy. In this work, we present SimCVD, a simple contrastive distillation framework that significantly advances state-of-the-art voxel-wise representation learning. We first describe an unsupervised training strategy, which takes two views of an input volume and predicts their signed distance maps of object boundaries in a contrastive objective, with only two independent dropout as mask. This simple approach works surprisingly well, performing on the same level as previous fully supervised methods with much less labeled data. We hypothesize that dropout can be viewed as a minimal form of data augmentation and makes the network robust to representation collapse. Then, we propose to perform structural distillation by distilling pair-wise similarities. We evaluate SimCVD on two popular datasets: the Left Atrial Segmentation Challenge (LA) and the NIH pancreas CT dataset. The results on the LA dataset demonstrate that, in two types of labeled ratios (i.e., 20% and 10%), SimCVD achieves an average Dice score of 90.85% and 89.03% respectively, a 0.91% and 2.22% improvement compared to previous best results. Our method can be trained in an end-to-end fashion, showing the promise of utilizing SimCVD as a general framework for downstream tasks, such as medical image synthesis, enhancement, and registration.
    Explaining and Measuring Functionalities of Malware Detectors. (arXiv:2111.10085v2 [cs.CR] UPDATED)
    (0 min) Numerous open-source and commercial malware detectors are available. However, their efficacy is threatened by new adversarial attacks, whereby malware attempts to evade detection, e.g., by performing feature-space manipulation. In this work, we propose an explainability-guided and model-agnostic framework for measuring the ability of malware to evade detection. The framework introduces the concept of Accrued Malicious Magnitude (AMM) to identify which malware features should be manipulated to maximize the likelihood of evading detection. We then use this framework to test several state-of-the-art malware detectors ability to detect manipulated malware. We find that (i) commercial antivirus engines are vulnerable to AMM-guided manipulated samples; (ii) the ability of a manipulated malware generated using one detector to evade detection by another detector (i.e., transferability) depends on the overlap of features with large AMM values between the different detectors; and (iii) AMM values effectively measure the importance of features and explain the ability to evade detection. Our findings shed light on the weaknesses of current malware detectors, as well as how they can be improved.
    ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees. (arXiv:2111.05496v2 [cs.LG] UPDATED)
    (0 min) Models recently used in the literature proving residual networks (ResNets) are better than linear predictors are actually different from standard ResNets that have been widely used in computer vision. In addition to the assumptions such as scalar-valued output or single residual block, these models have no nonlinearities at the final residual representation that feeds into the final affine layer. To codify such a difference in nonlinearities and reveal a linear estimation property, we define ResNEsts, i.e., Residual Nonlinear Estimators, by simply dropping nonlinearities at the last residual representation from standard ResNets. We show that wide ResNEsts with bottleneck blocks can always guarantee a very desirable training property that standard ResNets aim to achieve, i.e., adding more blocks does not decrease performance given the same set of basis elements. To prove that, we first recognize ResNEsts are basis function models that are limited by a coupling problem in basis learning and linear prediction. Then, to decouple prediction weights from basis learning, we construct a special architecture termed augmented ResNEst (A-ResNEst) that always guarantees no worse performance with the addition of a block. As a result, such an A-ResNEst establishes empirical risk lower bounds for a ResNEst using corresponding bases. Our results demonstrate ResNEsts indeed have a problem of diminishing feature reuse; however, it can be avoided by sufficiently expanding or widening the input space, leading to the above-mentioned desirable property. Inspired by the DenseNets that have been shown to outperform ResNets, we also propose a corresponding new model called Densely connected Nonlinear Estimator (DenseNEst). We show that any DenseNEst can be represented as a wide ResNEst with bottleneck blocks. Unlike ResNEsts, DenseNEsts exhibit the desirable property without any special architectural re-design.
    Common Phone: A Multilingual Dataset for Robust Acoustic Modelling. (arXiv:2201.05912v1 [eess.AS])
    (0 min) Current state of the art acoustic models can easily comprise more than 100 million parameters. This growing complexity demands larger training datasets to maintain a decent generalization of the final decision function. An ideal dataset is not necessarily large in size, but large with respect to the amount of unique speakers, utilized hardware and varying recording conditions. This enables a machine learning model to explore as much of the domain-specific input space as possible during parameter estimation. This work introduces Common Phone, a gender-balanced, multilingual corpus recorded from more than 76.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation. A Wav2Vec 2.0 acoustic model was trained with the Common Phone to perform phonetic symbol recognition and validate the quality of the generated phonetic annotation. The architecture achieved a PER of 18.1 % on the entire test set, computed with all 101 unique phonetic symbols, showing slight differences between the individual languages. We conclude that Common Phone provides sufficient variability and reliable phonetic annotation to help bridging the gap between research and application of acoustic models.
    Fast and Accurate Optical Fiber Channel Modeling Using Generative Adversarial Network. (arXiv:2002.12648v2 [cs.IT] UPDATED)
    (0 min) In this work, a new data-driven fiber channel modeling method, generative adversarial network (GAN) is investigated to learn the distribution of fiber channel transfer function. Our investigation focuses on joint channel effects of attenuation, chromic dispersion, self-phase modulation (SPM), and amplified spontaneous emission (ASE) noise. To achieve the success of GAN for channel modeling, we modify the loss function, design the condition vector of input and address the mode collapse for the long-haul transmission. The effective architecture, parameters, and training skills of GAN are also displayed in the paper. The results show that the proposed method can learn the accurate transfer function of the fiber channel. The transmission distance of modeling can be up to 1000 km and can be extended to arbitrary distance theoretically. Moreover, GAN shows robust generalization abilities under different optical launch powers, modulation formats, and input signal distributions. Comparing the complexity of GAN with the split-step Fourier method (SSFM), the total multiplication number is only 2% of SSFM and the running time is less than 0.1 seconds for 1000-km transmission, versus 400 seconds using the SSFM under the same hardware and software conditions, which highlights the remarkable reduction in complexity of the fiber channel modeling.
    Self-supervised Representation Learning of Neuronal Morphologies. (arXiv:2112.12482v2 [stat.ML] UPDATED)
    (0 min) Understanding the diversity of cell types and their function in the brain is one of the key challenges in neuroscience. The advent of large-scale datasets has given rise to the need of unbiased and quantitative approaches to cell type classification. We present GraphDINO, a purely data-driven approach to learning a low dimensional representation of the 3D morphology of neurons. GraphDINO is a novel graph representation learning method for spatial graphs utilizing self-supervised learning on transformer models. It combines attention-based global interaction between nodes and classic graph convolutional processing. We show, in two different species and cortical areas, that this method is able to yield morphological cell type clustering that is comparable to manual feature-based classification and shows a good correspondence to expert-labeled cell types. Our method is applicable beyond neuroscience in settings where samples in a dataset are graphs and graph-level embeddings are desired.
    Optimizing Area Under the Curve Measures via Matrix Factorization for Predicting Drug-Target Interaction with Multiple Similarities. (arXiv:2105.01545v2 [q-bio.QM] UPDATED)
    (0 min) In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Although it has been shown that fusing heterogeneous drug and target similarities can improve the prediction ability, the existing similarity combination methods ignore the interaction consistency for neighbour entities which is more crucial for the DTI prediction model. Furthermore, area under the precision-recall curve (AUPR) that emphasizes the accuracy of top-ranked pairs and area under the receiver operating characteristic curve (AUC) that heavily punishes the existence of low ranked interacting pairs are two widely used evaluation metrics in DTI prediction. However, the two metrics are seldom considered as losses within existing DTI prediction methods. This paper first proposes two matrix factorization (MF) methods that optimize AUPR and AUC using convex surrogate losses respectively, and then develops an ensemble MF approach takes advantage of the two area under the curve metrics by combining the two single metric based MF models. Both three proposed approaches incorporate a novel local interaction consistency aware similarity interaction method to generate fused drug and target similarities that preserve vital information from the more reliable view. Experimental results over five datasets under different prediction settings show that the proposed methods outperform various competitors in terms of the metric(s) they optimize. In addition, the validation on the top ranked novel predictions confirms the ability of our methods to discover potential new DTIs.
    A Law of Iterated Logarithm for Multi-Agent Reinforcement Learning. (arXiv:2110.15092v3 [cs.LG] UPDATED)
    (0 min) In Multi-Agent Reinforcement Learning (MARL), multiple agents interact with a common environment, as also with each other, for solving a shared problem in sequential decision-making. It has wide-ranging applications in gaming, robotics, finance, etc. In this work, we derive a novel law of iterated logarithm for a family of distributed nonlinear stochastic approximation schemes that is useful in MARL. In particular, our result describes the convergence rate on almost every sample path where the algorithm converges. This result is the first of its kind in the distributed setup and provides deeper insights than the existing ones, which only discuss convergence rates in the expected or the CLT sense. Importantly, our result holds under significantly weaker assumptions: neither the gossip matrix needs to be doubly stochastic nor the stepsizes square summable. As an application, we show that, for the stepsize $n^{-\gamma}$ with $\gamma \in (0, 1),$ the distributed TD(0) algorithm with linear function approximation has a convergence rate of $O(\sqrt{n^{-\gamma} \ln n })$ a.s.; for the $1/n$ type stepsize, the same is $O(\sqrt{n^{-1} \ln \ln n})$ a.s. These decay rates do not depend on the graph depicting the interactions among the different agents.
    Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding. (arXiv:2111.09098v4 [cs.CL] UPDATED)
    (0 min) EHR systems lack a unified code system forrepresenting medical concepts, which acts asa barrier for the deployment of deep learningmodels in large scale to multiple clinics and hos-pitals. To overcome this problem, we introduceDescription-based Embedding,DescEmb, a code-agnostic representation learning framework forEHR. DescEmb takes advantage of the flexibil-ity of neural language understanding models toembed clinical events using their textual descrip-tions rather than directly mapping each event toa dedicated embedding. DescEmb outperformedtraditional code-based embedding in extensiveexperiments, especially in a zero-shot transfertask (one hospital to another), and was able totrain a single unified model for heterogeneousEHR datasets.
    Disentanglement enables cross-domain Hippocampus Segmentation. (arXiv:2201.05650v1 [eess.IV])
    (0 min) Limited amount of labelled training data are a common problem in medical imaging. This makes it difficult to train a well-generalised model and therefore often leads to failure in unknown domains. Hippocampus segmentation from magnetic resonance imaging (MRI) scans is critical for the diagnosis and treatment of neuropsychatric disorders. Domain differences in contrast or shape can significantly affect segmentation. We address this issue by disentangling a T1-weighted MRI image into its content and domain. This separation enables us to perform a domain transfer and thus convert data from new sources into the training domain. This step thus simplifies the segmentation problem, resulting in higher quality segmentations. We achieve the disentanglement with the proposed novel methodology 'Content Domain Disentanglement GAN', and we propose to retrain the UNet on the transformed outputs to deal with GAN-specific artefacts. With these changes, we are able to improve performance on unseen domains by 6-13% and outperform state-of-the-art domain transfer methods.
    Stochastic Multi-Armed Bandits with Control Variates. (arXiv:2105.03962v3 [cs.LG] UPDATED)
    (0 min) This paper studies a new variant of the stochastic multi-armed bandits problem where auxiliary information about the arm rewards is available in the form of control variates. In many applications like queuing and wireless networks, the arm rewards are functions of some exogenous variables. The mean values of these variables are known a priori from historical data and can be used as control variates. Leveraging the theory of control variates, we obtain mean estimates with smaller variance and tighter confidence bounds. We develop an upper confidence bound based algorithm named UCB-CV and characterize the regret bounds in terms of the correlation between rewards and control variates when they follow a multivariate normal distribution. We also extend UCB-CV to other distributions using resampling methods like Jackknifing and Splitting. Experiments on synthetic problem instances validate performance guarantees of the proposed algorithms.
    Posterior Adaptation With New Priors. (arXiv:2007.01386v3 [cs.LG] UPDATED)
    (0 min) Classification approaches based on the direct estimation and analysis of posterior probabilities will degrade if the original class priors begin to change. We prove that a unique (up to scale) solution is possible to recover the data likelihoods for a test example from its original class posteriors and dataset priors. Given the recovered likelihoods and a set of new priors, the posteriors can be re-computed using Bayes' Rule to reflect the influence of the new priors. The method is simple to compute and allows a dynamic update of the original posteriors.
    Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next. (arXiv:2201.05624v1 [cs.LG])
    (0 min) Physic-Informed Neural Networks (PINN) are neural networks (NNs) that encode model equations, like Partial Differential Equations (PDE), as a component of the neural network itself. PINNs are nowadays used to solve PDEs, fractional equations, and integral-differential equations. This novel methodology has arisen as a multi-task learning framework in which a NN must fit observed data while reducing a PDE residual. This article provides a comprehensive review of the literature on PINNs: while the primary goal of the study was to characterize these networks and their related advantages and disadvantages, the review also attempts to incorporate publications on a larger variety of issues, including physics-constrained neural networks (PCNN), where the initial or boundary conditions are directly embedded in the NN structure rather than in the loss functions. The study indicates that most research has focused on customizing the PINN through different activation functions, gradient optimization techniques, neural network structures, and loss function structures. Despite the wide range of applications for which PINNs have been used, by demonstrating their ability to be more feasible in some contexts than classical numerical techniques like Finite Element Method (FEM), advancements are still possible, most notably theoretical issues that remain unresolved.
    Block Policy Mirror Descent. (arXiv:2201.05756v1 [cs.LG])
    (0 min) In this paper, we present a new class of policy gradient (PG) methods, namely the block policy mirror descent (BPMD) methods for solving a class of regularized reinforcement learning (RL) problems with (strongly) convex regularizers. Compared to the traditional PG methods with batch update rule, which visit and update the policy for every state, BPMD methods have cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, BPMD methods achieve fast linear convergence to the global optimality. We further extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. We establish $\cO(1/\epsilon)$ (resp. $\cO(1/\epsilon^2)$) sample complexity for the strongly convex (resp. non-strongly convex) regularizers, with different procedures for constructing the stochastic first-order information, where $\epsilon$ denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning.
    Automated causal inference in application to randomized controlled clinical trials. (arXiv:2201.05773v1 [stat.ME])
    (0 min) Randomized controlled trials (RCTs) are considered as the gold standard for testing causal hypotheses in the clinical domain. However, the investigation of prognostic variables of patient outcome in a hypothesized cause-effect route is not feasible using standard statistical methods. Here, we propose a new automated causal inference method (AutoCI) built upon the invariant causal prediction (ICP) framework for the causal re-interpretation of clinical trial data. Compared to existing methods, we show that the proposed AutoCI allows to efficiently determine the causal variables with a clear differentiation on two real-world RCTs of endometrial cancer patients with mature outcome and extensive clinicopathological and molecular data. This is achieved via suppressing the causal probability of non-causal variables by a wide margin. In ablation studies, we further demonstrate that the assignment of causal probabilities by AutoCI remain consistent in the presence of confounders. In conclusion, these results confirm the robustness and feasibility of AutoCI for future applications in real-world clinical analysis.
    Noncommutative Geometry of Computational Models and Uniformization for Framed Quiver Varieties. (arXiv:2201.05900v1 [math.AG])
    (0 min) We formulate a mathematical setup for computational neural networks using noncommutative algebras and near-rings, in motivation of quantum automata. We study the moduli space of the corresponding framed quiver representations, and find moduli of Euclidean and non-compact types in light of uniformization.
    Effective and Efficient Graph Learning for Multi-view Clustering. (arXiv:2108.06734v2 [cs.LG] UPDATED)
    (0 min) Despite the impressive clustering performance and efficiency in characterizing both the relationship between data and cluster structure, existing graph-based multi-view clustering methods still have the following drawbacks. They suffer from the expensive time burden due to both the construction of graphs and eigen-decomposition of Laplacian matrix, and fail to explore the cluster structure of large-scale data. Moreover, they require a post-processing to get the final clustering, resulting in suboptimal performance. Furthermore, rank of the learned view-consensus graph cannot approximate the target rank. In this paper, drawing the inspiration from the bipartite graph, we propose an effective and efficient graph learning model for multi-view clustering. Specifically, our method exploits the view-similar between graphs of different views by the minimization of tensor Schatten p-norm, which well characterizes both the spatial structure and complementary information embedded in graphs of different views. We learn view-consensus graph with adaptively weighted strategy and connectivity constraint such that the connected components indicates clusters directly. Our proposed algorithm is time-economical and obtains the stable results and scales well with the data size. Extensive experimental results indicate that our method is superior to state-of-the-art methods.
    Prediction of Drug-Induced TdP Risks Using Machine Learning and Rabbit Ventricular Wedge Assay. (arXiv:2201.05669v1 [q-bio.QM])
    (0 min) The evaluation of drug-induced Torsades de pointes (TdP) risks is crucial in drug safety assessment. In this study, we discuss machine learning approaches in the prediction of drug-induced TdP risks using preclinical data. Specifically, the random forest model was trained on the dataset generated by the rabbit ventricular wedge assay. The model prediction performance was measured on 28 drugs from the Comprehensive In Vitro Proarrhythmia Assay initiative. Leave-one-drug-out cross-validation provided an unbiased estimation of model performance. Stratified bootstrap revealed the uncertainty in the asymptotic model prediction. Our study validated the utility of machine learning approaches in predicting drug-induced TdP risks from preclinical data. Our methods can be extended to other preclinical protocols and serve as a supplementary evaluation in drug safety assessment.
    Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding. (arXiv:2108.03625v2 [cs.LG] UPDATED)
    (0 min) Substantial increase in the use of Electronic Health Records (EHRs) has opened new frontiers for predictive healthcare. However, while EHR systems are nearly ubiquitous, they lack a unified code system for representing medical concepts. Heterogeneous formats of EHR present a substantial barrier for the training and deployment of state-of-the-art deep learning models at scale. To overcome this problem, we introduce Description-based Embedding, DescEmb, a code-agnostic description-based representation learning framework for predictive modeling on EHR. DescEmb takes advantage of the flexibility of neural language understanding models while maintaining a neutral approach that can be combined with prior frameworks for task-specific representation learning or predictive modeling. We tested our model's capacity on various experiments including prediction tasks, transfer learning and pooled learning. DescEmb shows higher performance in overall experiments compared to code-based approach, opening the door to a text-based approach in predictive healthcare research that is not constrained by EHR structure nor special domain knowledge.
    CAFE: Catastrophic Data Leakage in Vertical Federated Learning. (arXiv:2110.15122v3 [cs.LG] UPDATED)
    (0 min) Recent studies show that private training data can be leaked through the gradients sharing mechanism deployed in distributed machine learning systems, such as federated learning (FL). Increasing batch size to complicate data recovery is often viewed as a promising defense strategy against data leakage. In this paper, we revisit this defense premise and propose an advanced data leakage attack with theoretical justification to efficiently recover batch data from the shared aggregated gradients. We name our proposed method as catastrophic data leakage in vertical federated learning (CAFE). Comparing to existing data leakage attacks, our extensive experimental results on vertical FL settings demonstrate the effectiveness of CAFE to perform large-batch data leakage attack with improved data recovery quality. We also propose a practical countermeasure to mitigate CAFE. Our results suggest that private data participated in standard FL, especially the vertical case, have a high risk of being leaked from the training gradients. Our analysis implies unprecedented and practical data leakage risks in those learning settings. The code of our work is available at https://github.com/DeRafael/CAFE.
    On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay. (arXiv:2106.15739v3 [cs.LG] UPDATED)
    (2 min) Training neural networks with batch normalization and weight decay has become a common practice in recent years. In this work, we show that their combined use may result in a surprising periodic behavior of optimization dynamics: the training process regularly exhibits destabilizations that, however, do not lead to complete divergence but cause a new period of training. We rigorously investigate the mechanism underlying the discovered periodic behavior from both empirical and theoretical points of view and analyze the conditions in which it occurs in practice. We also demonstrate that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.
    StolenEncoder: Stealing Pre-trained Encoders. (arXiv:2201.05889v1 [cs.CR])
    (2 min) Pre-trained encoders are general-purpose feature extractors that can be used for many downstream tasks. Recent progress in self-supervised learning can pre-train highly effective encoders using a large volume of unlabeled data, leading to the emerging encoder as a service (EaaS). A pre-trained encoder may be deemed confidential because its training often requires lots of data and computation resources as well as its public release may facilitate misuse of AI, e.g., for deepfakes generation. In this paper, we propose the first attack called StolenEncoder to steal pre-trained image encoders. We evaluate StolenEncoder on multiple target encoders pre-trained by ourselves and three real-world target encoders including the ImageNet encoder pre-trained by Google, CLIP encoder pre-trained by OpenAI, and Clarifai's General Embedding encoder deployed as a paid EaaS. Our results show that the encoders stolen by StolenEncoder have similar functionality with the target encoders. In particular, the downstream classifiers built upon a target encoder and a stolen encoder have similar accuracy. Moreover, stealing a target encoder using StolenEncoder requires much less data and computation resources than pre-training it from scratch. We also explore three defenses that perturb feature vectors produced by a target encoder. Our evaluation shows that these defenses are not enough to mitigate StolenEncoder.
    Learn Quasi-stationary Distributions of Finite State Markov Chain. (arXiv:2111.11213v2 [cs.LG] UPDATED)
    (2 min) We propose a reinforcement learning (RL) approach to compute the expression of quasi-stationary distribution. Based on the fixed-point formulation of quasi-stationary distribution, we minimize the KL-divergence of two Markovian path distributions induced by the candidate distribution and the true target distribution. To solve this challenging minimization problem by gradient descent, we apply the reinforcement learning technique by introducing the reward and value functions. We derive the corresponding policy gradient theorem and design an actor-critic algorithm to learn the optimal solution and the value function. The numerical examples of finite state Markov chain are tested to demonstrate the new method.
    A Framework for Pedestrian Sub-classification and Arrival Time Prediction at Signalized Intersection Using Preprocessed Lidar Data. (arXiv:2201.05877v1 [cs.LG])
    (2 min) The mortality rate for pedestrians using wheelchairs was 36% higher than the overall population pedestrian mortality rate. However, there is no data to clarify the pedestrians' categories in both fatal and nonfatal accidents, since police reports often do not keep a record of whether a victim was using a wheelchair or has a disability. Currently, real-time detection of vulnerable road users using advanced traffic sensors installed at the infrastructure side has a great potential to significantly improve traffic safety at the intersection. In this research, we develop a systematic framework with a combination of machine learning and deep learning models to distinguish disabled people from normal walk pedestrians and predict the time needed to reach the next side of the intersection. The proposed framework shows high performance both at vulnerable user classification and arrival time prediction accuracy.
    Exposing the Obscured Influence of State-Controlled Media: A Causal Estimation of Influence Between Media Outlets Via Quotation Propagation. (arXiv:2201.05985v1 [cs.SI])
    (2 min) This study quantifies influence between media outlets by applying a novel methodology that uses causal effect estimation on networks and transformer language models. We demonstrate the obscured influence of state-controlled outlets over other outlets, regardless of orientation, by analyzing a large dataset of quotations from over 100 thousand articles published by the most prominent European and Russian traditional media outlets, appearing between May 2018 and October 2019. The analysis maps out the network structure of influence with news wire services serving as prominent bridges that connect outlets in different geo-political spheres. Overall, this approach demonstrates capabilities to identify and quantify the channels of influence in intermedia agenda setting over specific topics.
    Learning Multi-Objective Curricula for Deep Reinforcement Learning. (arXiv:2110.03032v2 [cs.LG] UPDATED)
    (2 min) Various automatic curriculum learning (ACL) methods have been proposed to improve the sample efficiency and final performance of deep reinforcement learning (DRL). They are designed to control how a DRL agent collects data, which is inspired by how humans gradually adapt their learning processes to their capabilities. For example, ACL can be used for subgoal generation, reward shaping, environment generation, or initial state generation. However, prior work only considers curriculum learning following one of the aforementioned predefined paradigms. It is unclear which of these paradigms are complementary, and how the combination of them can be learned from interactions with the environment. Therefore, in this paper, we propose a unified automatic curriculum learning framework to create multi-objective but coherent curricula that are generated by a set of parametric curriculum modules. Each curriculum module is instantiated as a neural network and is responsible for generating a particular curriculum. In order to coordinate those potentially conflicting modules in unified parameter space, we propose a multi-task hyper-net learning framework that uses a single hyper-net to parameterize all those curriculum modules. In addition to existing hand-designed curricula paradigms, we further design a flexible memory mechanism to learn an abstract curriculum, which may otherwise be difficult to design manually. We evaluate our method on a series of robotic manipulation tasks and demonstrate its superiority over other state-of-the-art ACL methods in terms of sample efficiency and final performance.
    An adaptive stochastic gradient-free approach for high-dimensional blackbox optimization. (arXiv:2006.10887v2 [math.OC] UPDATED)
    (2 min) In this work, we propose a novel adaptive stochastic gradient-free (ASGF) approach for solving high-dimensional nonconvex optimization problems based on function evaluations. We employ a directional Gaussian smoothing of the target function that generates a surrogate of the gradient and assists in avoiding bad local optima by utilizing nonlocal information of the loss landscape. Applying a deterministic quadrature scheme results in a massively scalable technique that is sample-efficient and achieves spectral accuracy. At each step we randomly generate the search directions while primarily following the surrogate of the smoothed gradient. This enables exploitation of the gradient direction while maintaining sufficient space exploration, and accelerates convergence towards the global extrema. In addition, we make use of a local approximation of the Lipschitz constant in order to adaptively adjust the values of all hyperparameters, thus removing the careful fine-tuning of current algorithms that is often necessary to be successful when applied to a large class of learning tasks. As such, the ASGF strategy offers significant improvements when solving high-dimensional nonconvex optimization problems when compared to other gradient-free methods (including the so called "evolutionary strategies") as well as iterative approaches that rely on the gradient information of the objective function. We illustrate the improved performance of this method by providing several comparative numerical studies on benchmark global optimization problems and reinforcement learning tasks.
    Towards Biologically Plausible Convolutional Networks. (arXiv:2106.13031v2 [cs.LG] UPDATED)
    (2 min) Convolutional networks are ubiquitous in deep learning. They are particularly useful for images, as they reduce the number of parameters, reduce training time, and increase accuracy. However, as a model of the brain they are seriously problematic, since they require weight sharing - something real neurons simply cannot do. Consequently, while neurons in the brain can be locally connected (one of the features of convolutional networks), they cannot be convolutional. Locally connected but non-convolutional networks, however, significantly underperform convolutional ones. This is troublesome for studies that use convolutional networks to explain activity in the visual system. Here we study plausible alternatives to weight sharing that aim at the same regularization principle, which is to make each neuron within a pool react similarly to identical inputs. The most natural way to do that is by showing the network multiple translations of the same image, akin to saccades in animal vision. However, this approach requires many translations, and doesn't remove the performance gap. We propose instead to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity. This requires the network to pause occasionally for a sleep-like phase of "weight sharing". This method enables locally connected networks to achieve nearly convolutional performance on ImageNet and improves their fit to the ventral stream data, thus supporting convolutional networks as a model of the visual stream.
    Predictive Maintenance for General Aviation Using Convolutional Transformers. (arXiv:2110.03757v2 [cs.LG] UPDATED)
    (2 min) Predictive maintenance systems have the potential to significantly reduce costs for maintaining aircraft fleets as well as provide improved safety by detecting maintenance issues before they come severe. However, the development of such systems has been limited due to a lack of publicly labeled multivariate time series (MTS) sensor data. MTS classification has advanced greatly over the past decade, but there is a lack of sufficiently challenging benchmarks for new methods. This work introduces the NGAFID Maintenance Classification (NGAFID-MC) dataset as a novel benchmark in terms of difficulty, number of samples, and sequence length. NGAFID-MC consists of over 7,500 labeled flights, representing over 11,500 hours of per second flight data recorder readings of 23 sensor parameters. Using this benchmark, we demonstrate that Recurrent Neural Network (RNN) methods are not well suited for capturing temporally distant relationships and propose a new architecture called Convolutional Multiheaded Self Attention (Conv-MHSA) that achieves greater classification performance at greater computational efficiency. We also demonstrate that image inspired augmentations of cutout, mixup, and cutmix, can be used to reduce overfitting and improve generalization in MTS classification. Our best trained models have been incorporated back into the NGAFID to allow users to potentially detect flights that require maintenance as well as provide feedback to further expand and refine the NGAFID-MC dataset.
    Diffusion Tensor Estimation with Transformer Neural Networks. (arXiv:2201.05701v1 [eess.IV])
    (2 min) Diffusion tensor imaging (DTI) is the most widely used tool for studying brain white matter development and degeneration. However, standard DTI estimation methods depend on a large number of high-quality measurements. This would require long scan times and can be particularly difficult to achieve with certain patient populations such as neonates. Here, we propose a method that can accurately estimate the diffusion tensor from only six diffusion-weighted measurements. Our method achieves this by learning to exploit the relationships between the diffusion signals and tensors in neighboring voxels. Our model is based on transformer networks, which represent the state of the art in modeling the relationship between signals in a sequence. In particular, our model consists of two such networks. The first network estimates the diffusion tensor based on the diffusion signals in a neighborhood of voxels. The second network provides more accurate tensor estimations by learning the relationships between the diffusion signals as well as the tensors estimated by the first network in neighboring voxels. Our experiments with three datasets show that our proposed method achieves highly accurate estimations of the diffusion tensor and is significantly superior to three competing methods. Estimations produced by our method with six measurements are comparable with those of standard estimation methods with 30-88 measurements. Hence, our method promises shorter scan times and more reliable assessment of brain white matter, particularly in non-cooperative patients such as neonates and infants.
    Inference via Sparse Coding in a Hierarchical Vision Model. (arXiv:2108.01548v3 [cs.CV] UPDATED)
    (2 min) Sparse coding has been incorporated in models of the visual cortex for its computational advantages and connection to biology. But how the level of sparsity contributes to performance on visual tasks is not well understood. In this work, sparse coding has been integrated into an existing hierarchical V2 model (Hosoya and Hyv\"arinen, 2015), but replacing its independent component analysis (ICA) with an explicit sparse coding in which the degree of sparsity can be controlled. After training, the sparse coding basis functions with a higher degree of sparsity resembled qualitatively different structures, such as curves and corners. The contributions of the models were assessed with image classification tasks, specifically tasks associated with mid-level vision including figure-ground classification, texture classification, and angle prediction between two line stimuli. In addition, the models were assessed in comparison to a texture sensitivity measure that has been reported in V2 (Freeman et al., 2013), and a deleted-region inference task. The results from the experiments show that while sparse coding performed worse than ICA at classifying images, only sparse coding was able to better match the texture sensitivity level of V2 and infer deleted image regions, both by increasing the degree of sparsity in sparse coding. Higher degrees of sparsity allowed for inference over larger deleted image regions. The mechanism that allows for this inference capability in sparse coding is described here.
    Taylor-Lagrange Neural Ordinary Differential Equations: Toward Fast Training and Evaluation of Neural ODEs. (arXiv:2201.05715v1 [cs.LG])
    (2 min) Neural ordinary differential equations (NODEs) -- parametrizations of differential equations using neural networks -- have shown tremendous promise in learning models of unknown continuous-time dynamical systems from data. However, every forward evaluation of a NODE requires numerical integration of the neural network used to capture the system dynamics, making their training prohibitively expensive. Existing works rely on off-the-shelf adaptive step-size numerical integration schemes, which often require an excessive number of evaluations of the underlying dynamics network to obtain sufficient accuracy for training. By contrast, we accelerate the evaluation and the training of NODEs by proposing a data-driven approach to their numerical integration. The proposed Taylor-Lagrange NODEs (TL-NODEs) use a fixed-order Taylor expansion for numerical integration, while also learning to estimate the expansion's approximation error. As a result, the proposed approach achieves the same accuracy as adaptive step-size schemes while employing only low-order Taylor expansions, thus greatly reducing the computational cost necessary to integrate the NODE. A suite of numerical experiments, including modeling dynamical systems, image classification, and density estimation, demonstrate that TL-NODEs can be trained more than an order of magnitude faster than state-of-the-art approaches, without any loss in performance.
    Intrusion Prevention through Optimal Stopping. (arXiv:2111.00289v2 [cs.LG] UPDATED)
    (2 min) We study automated intrusion prevention using reinforcement learning. Following a novel approach, we formulate the problem of intrusion prevention as an (optimal) multiple stopping problem. This formulation gives us insight into the structure of optimal policies, which we show to have threshold properties. For most practical cases, it is not feasible to obtain an optimal defender policy using dynamic programming. We therefore develop a reinforcement learning approach to approximate an optimal policy. Our method for learning and validating policies includes two systems: a simulation system where defender policies are incrementally learned and an emulation system where statistics are produced that drive simulation runs and where learned policies are evaluated. We show that our approach can produce effective defender policies for a practical IT infrastructure of limited size. Inspection of the learned policies confirms that they exhibit threshold properties.
    Autoencoders for Semivisible Jet Detection. (arXiv:2112.02864v3 [hep-ph] UPDATED)
    (2 min) The production of dark matter particles from confining dark sectors may lead to many novel experimental signatures. Depending on the details of the theory, dark quark production in proton-proton collisions could result in semivisible jets of particles: collimated sprays of dark hadrons of which only some are detectable by particle collider experiments. The experimental signature is characterised by the presence of reconstructed missing momentum collinear with the visible components of the jets. This complex topology is sensitive to detector inefficiencies and mis-reconstruction that generate artificial missing momentum. With this work, we propose a signal-agnostic strategy to reject ordinary jets and identify semivisible jets via anomaly detection techniques. A deep neural autoencoder network with jet substructure variables as input proves highly useful for analyzing anomalous jets. The study focuses on the semivisible jet signature; however, the technique can apply to any new physics model that predicts signatures with anomalous jets from non-SM particles.
    Molecular Contrastive Learning of Representations via Graph Neural Networks. (arXiv:2102.10056v2 [cs.LG] UPDATED)
    (2 min) Molecular Machine Learning (ML) bears promise for efficient molecule property prediction and drug discovery. However, labeled molecule data can be expensive and time-consuming to acquire. Due to the limited labeled data, it is a great challenge for supervised-learning ML models to generalize to the giant chemical space. In this work, we present MolCLR: Molecular Contrastive Learning of Representations via Graph Neural Networks (GNNs), a self-supervised learning framework that leverages large unlabeled data (~10M unique molecules). In MolCLR pre-training, we build molecule graphs and develop GNN encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion, and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of GNNs on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabeled database, MolCLR even achieves state-of-the-art on several challenging benchmarks after fine-tuning. Additionally, further investigations demonstrate that MolCLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities.
    Missing Data Estimation in Temporal Multilayer Position-aware Graph Neural Network (TMP-GNN). (arXiv:2108.03400v2 [cs.LG] UPDATED)
    (2 min) GNNs have been proven to perform highly effective in various node-level, edge-level, and graph-level prediction tasks in several domains. Existing approaches mainly focus on static graphs. However, many graphs change over time with their edge may disappear, or node/edge attribute may alter from one time to the other. It is essential to consider such evolution in representation learning of nodes in time varying graphs. In this paper, we propose a Temporal Multi-layered Position-aware Graph Neural Network (TMP-GNN), a node embedding approach for dynamic graph that incorporates the interdependence of temporal relations into embedding computation. We evaluate the performance of TMP-GNN on two different representations of temporal multilayered graphs. The performance is assessed against the most popular GNNs on node-level prediction tasks. Then, we incorporate TMP-GNN into a deep learning framework to estimate missing data and compare the performance with their corresponding competent GNNs from our former experiment, and a baseline method. Experimental results on four real-world datasets yield up to 58% of lower ROC AUC for pairwise node classification task, and 96% of lower MAE in missing feature estimation, particularly for graphs with a relatively high number of nodes and lower mean degree of connectivity.
    Inverting Adversarially Robust Networks for Image Synthesis. (arXiv:2106.06927v2 [cs.CV] UPDATED)
    (2 min) Despite unconditional feature inverters being the foundation of many synthesis tasks, training them requires a large computational overhead, decoding capacity or additional autoregressive priors. We propose to train an adversarially robust encoder to learn disentangled and perceptually-aligned bottleneck features, making them easily invertible. Then, by training a simple generator with the mirror architecture of the encoder, we achieve superior reconstructions and generalization over standard approaches. We exploit such properties using an encoding-decoding network based on AR features and demonstrate its oustanding performance on three applications: anomaly detection, style transfer and image denoising. Comparisons against alternative learn-based methods show that our model attains improved performance with significantly less training parameters.
    Global Regular Network for Writer Identification. (arXiv:2201.05951v1 [cs.CV])
    (0 min) Writer identification has practical applications for forgery detection and forensic science. Most models based on deep neural networks extract features from character image or sub-regions in character image, which ignoring features contained in page-region image. Our proposed global regular network (GRN) pays attention to these features. GRN network consists of two branches: one branch takes page handwriting as input to extract global features, and the other takes word handwriting as input to extract local features. Global features and local features merge in a global residual way to form overall features of the handwriting. The proposed GRN has two attributions: one is adding a branch to extract features contained in page; the other is using residual attention network to extract local feature. Experiments demonstrate the effectiveness of both strategies. On CVL dataset, our models achieve impressive 99.98% top-1 accuracy and 100% top-5 accuracy with shorter training time and fewer network parameters, which exceeded the state-of-the-art structure. The experiment shows the powerful ability of the network in the field of writer identification. The source code is available at https://github.com/wangshiyu001/GRN.
    Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering. (arXiv:2106.02634v2 [cs.CV] UPDATED)
    (0 min) Inferring representations of 3D scenes from 2D observations is a fundamental problem of computer graphics, computer vision, and artificial intelligence. Emerging 3D-structured neural scene representations are a promising approach to 3D scene understanding. In this work, we propose a novel neural scene representation, Light Field Networks or LFNs, which represent both geometry and appearance of the underlying 3D scene in a 360-degree, four-dimensional light field parameterized via a neural implicit representation. Rendering a ray from an LFN requires only a single network evaluation, as opposed to hundreds of evaluations per ray for ray-marching or volumetric based renderers in 3D-structured neural scene representations. In the setting of simple scenes, we leverage meta-learning to learn a prior over LFNs that enables multi-view consistent light field reconstruction from as little as a single image observation. This results in dramatic reductions in time and memory complexity, and enables real-time rendering. The cost of storing a 360-degree light field via an LFN is two orders of magnitude lower than conventional methods such as the Lumigraph. Utilizing the analytical differentiability of neural implicit representations and a novel parameterization of light space, we further demonstrate the extraction of sparse depth maps from LFNs.
    Weighting and Pruning based Ensemble Deep Random Vector Functional Link Network for Tabular Data Classification. (arXiv:2201.05809v1 [cs.LG])
    (0 min) In this paper, we first introduce batch normalization to the edRVFL network. This re-normalization method can help the network avoid divergence of the hidden features. Then we propose novel variants of Ensemble Deep Random Vector Functional Link (edRVFL). Weighted edRVFL (WedRVFL) uses weighting methods to give training samples different weights in different layers according to how the samples were classified confidently in the previous layer thereby increasing the ensemble's diversity and accuracy. Furthermore, a pruning-based edRVFL (PedRVFL) has also been proposed. We prune some inferior neurons based on their importance for classification before generating the next hidden layer. Through this method, we ensure that the randomly generated inferior features will not propagate to deeper layers. Subsequently, the combination of weighting and pruning, called Weighting and Pruning based Ensemble Deep Random Vector Functional Link Network (WPedRVFL), is proposed. We compare their performances with other state-of-the-art deep feedforward neural networks (FNNs) on 24 tabular UCI classification datasets. The experimental results illustrate the superior performance of our proposed methods.
    Transferability in Deep Learning: A Survey. (arXiv:2201.05867v1 [cs.LG])
    (0 min) The success of deep learning algorithms generally depends on large-scale data, while humans appear to have inherent ability of knowledge transfer, by recognizing and applying relevant knowledge from previous learning experiences when encountering and solving unseen tasks. Such an ability to acquire and reuse knowledge is known as transferability in deep learning. It has formed the long-term quest towards making deep learning as data-efficient as human learning, and has been motivating fruitful design of more powerful deep learning algorithms. We present this survey to connect different isolated areas in deep learning with their relation to transferability, and to provide a unified and complete view to investigating transferability through the whole lifecycle of deep learning. The survey elaborates the fundamental goals and challenges in parallel with the core principles and methods, covering recent cornerstones in deep architectures, pre-training, task adaptation and domain adaptation. This highlights unanswered questions on the appropriate objectives for learning transferable knowledge and for adapting the knowledge to new tasks and domains, avoiding catastrophic forgetting and negative transfer. Finally, we implement a benchmark and an open-source library, enabling a fair evaluation of deep learning methods in terms of transferability.
    Learning Graph Structures with Transformer for Multivariate Time Series Anomaly Detection in IoT. (arXiv:2104.03466v3 [cs.LG] UPDATED)
    (0 min) Many real-world IoT systems, which include a variety of internet-connected sensory devices, produce substantial amounts of multivariate time series data. Meanwhile, vital IoT infrastructures like smart power grids and water distribution networks are frequently targeted by cyber-attacks, making anomaly detection an important study topic. Modeling such relatedness is, nevertheless, unavoidable for any efficient and effective anomaly detection system, given the intricate topological and nonlinear connections that are originally unknown among sensors. Furthermore, detecting anomalies in multivariate time series is difficult due to their temporal dependency and stochasticity. This paper presented GTA, a new framework for multivariate time series anomaly detection that involves automatically learning a graph structure, graph convolution, and modeling temporal dependency using a Transformer-based architecture. The connection learning policy, which is based on the Gumbel-softmax sampling approach to learn bi-directed links among sensors directly, is at the heart of learning graph structure. To describe the anomaly information flow between network nodes, we introduced a new graph convolution called Influence Propagation convolution. In addition, to tackle the quadratic complexity barrier, we suggested a multi-branch attention mechanism to replace the original multi-head self-attention method. Extensive experiments on four publicly available anomaly detection benchmarks further demonstrate the superiority of our approach over alternative state-of-the-arts. Codes are available at https://github.com/ZEKAICHEN/GTA.
    Improved Reinforcement Learning in Cooperative Multi-agent Environments Using Knowledge Transfer. (arXiv:2107.09807v5 [cs.MA] UPDATED)
    (0 min) Nowadays, cooperative multi-agent systems are used to learn how to achieve goals in large-scale dynamic environments. However, learning in these environments is challenging: from the effect of search space size on learning time to inefficient cooperation among agents. Moreover, reinforcement learning algorithms may suffer from a long time of convergence in such environments. In this paper, a communication framework is introduced. In the proposed communication framework, agents learn to cooperate effectively and also by introduction of a new state calculation method the size of state space will decline considerably. Furthermore, a knowledge-transferring algorithm is presented to share the gained experiences among the different agents, and develop an effective knowledge-fusing mechanism to fuse the knowledge learnt utilizing the agents' own experiences with the knowledge received from other team members. Finally, the simulation results are provided to indicate the efficacy of the proposed method in the complex learning task. We have evaluated our approach on the shepherding problem and the results show that the learning process accelerates by making use of the knowledge transferring mechanism and the size of state space has declined by generating similar states based on state abstraction concept.
    UDC: Unified DNAS for Compressible TinyML Models. (arXiv:2201.05842v1 [cs.LG])
    (0 min) Emerging Internet-of-things (IoT) applications are driving deployment of neural networks (NNs) on heavily constrained low-cost hardware (HW) platforms, where accuracy is typically limited by memory capacity. To address this TinyML challenge, new HW platforms like neural processing units (NPUs) have support for model compression, which exploits aggressive network quantization and unstructured pruning optimizations. The combination of NPUs with HW compression and compressible models allows more expressive models in the same memory footprint. However, adding optimizations for compressibility on top of conventional NN architecture choices expands the design space across which we must make balanced trade-offs. This work bridges the gap between NPU HW capability and NN model design, by proposing a neural arcthiecture search (NAS) algorithm to efficiently search a large design space, including: network depth, operator type, layer width, bitwidth, sparsity, and more. Building on differentiable NAS (DNAS) with several key improvements, we demonstrate Unified DNAS for Compressible models (UDC) on CIFAR100, ImageNet, and DIV2K super resolution tasks. On ImageNet, we find Pareto dominant compressible models, which are 1.9x smaller or 5.76% more accurate.
    Robust uncertainty estimates with out-of-distribution pseudo-inputs training. (arXiv:2201.05890v1 [cs.LG])
    (0 min) Probabilistic models often use neural networks to control their predictive uncertainty. However, when making out-of-distribution (OOD)} predictions, the often-uncontrollable extrapolation properties of neural networks yield poor uncertainty predictions. Such models then don't know what they don't know, which directly limits their robustness w.r.t unexpected inputs. To counter this, we propose to explicitly train the uncertainty predictor where we are not given data to make it reliable. As one cannot train without data, we provide mechanisms for generating pseudo-inputs in informative low-density regions of the input space, and show how to leverage these in a practical Bayesian framework that casts a prior distribution over the model uncertainty. With a holistic evaluation, we demonstrate that this yields robust and interpretable predictions of uncertainty while retaining state-of-the-art performance on diverse tasks such as regression and generative modelling
    Transformers in Action:Weakly Supervised Action Segmentation. (arXiv:2201.05675v1 [cs.CV])
    (0 min) The video action segmentation task is regularly explored under weaker forms of supervision, such as transcript supervision, where a list of actions is easier to obtain than dense frame-wise labels. In this formulation, the task presents various challenges for sequence modeling approaches due to the emphasis on action transition points, long sequence lengths, and frame contextualization, making the task well-posed for transformers. Given developments enabling transformers to scale linearly, we demonstrate through our architecture how they can be applied to improve action alignment accuracy over the equivalent RNN-based models with the attention mechanism focusing around salient action transition regions. Additionally, given the recent focus on inference-time transcript selection, we propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time. Furthermore, we subsequently demonstrate how this approach can also improve the overall segmentation performance. Finally, we evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers and the importance of transcript selection on this video-driven weakly-supervised task.
    Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization. (arXiv:2201.05752v1 [cs.LG])
    (0 min) Achieving efficient execution of machine learning models has attracted significant attention recently. To generate tensor programs efficiently, a key component of DNN compilers is the cost model that can predict the performance of each configuration on specific devices. However, due to the rapid emergence of hardware platforms, it is increasingly labor-intensive to train domain-specific predictors for every new platform. Besides, current design of cost models cannot provide transferable features between different hardware accelerators efficiently and effectively. In this paper, we propose Moses, a simple and efficient design based on the lottery ticket hypothesis, which fully takes advantage of the features transferable to the target device via domain adaptation. Compared with state-of-the-art approaches, Moses achieves up to 1.53X efficiency gain in the search stage and 1.41X inference speedup on challenging DNN benchmarks.
    Training Fair Deep Neural Networks by Balancing Influence. (arXiv:2201.05759v1 [cs.LG])
    (0 min) Most fair machine learning methods either highly rely on the sensitive information of the training samples or require a large modification on the target models, which hinders their practical application. To address this issue, we propose a two-stage training algorithm named FAIRIF. It minimizes the loss over the reweighted data set (second stage) where the sample weights are computed to balance the model performance across different demographic groups (first stage). FAIRIF can be applied on a wide range of models trained by stochastic gradient descent without changing the model, while only requiring group annotations on a small validation set to compute sample weights. Theoretically, we show that, in the classification setting, three notions of disparity among different groups can be mitigated by training with the weights. Experiments on synthetic data sets demonstrate that FAIRIF yields models with better fairness-utility trade-offs against various types of bias; and on real-world data sets, we show the effectiveness and scalability of FAIRIF. Moreover, as evidenced by the experiments with pretrained models, FAIRIF is able to alleviate the unfairness issue of pretrained models without hurting their performance.
    Imputing Missing Observations with Time Sliced Synthetic Minority Oversampling Technique. (arXiv:2201.05634v1 [cs.LG])
    (0 min) We present a simple yet novel time series imputation technique with the goal of constructing an irregular time series that is uniform across every sample in a data set. Specifically, we fix a grid defined by the midpoints of non-overlapping bins (dubbed "slices") of observation times and ensure that each sample has values for all of the features at that given time. This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features. To do so, we slightly generalize the well-known class imbalance algorithm SMOTE \cite{smote} to allow component wise nearest neighbor interpolation that preserves correlations when there are no missing features. We visualize the method in the simplified setting of 2-dimensional uncoupled harmonic oscillators. Next, we use tSMOTE to train an Encoder/Decoder long-short term memory (LSTM) model with Logistic Regression for predicting and classifying distinct trajectories of different 2D oscillators. After illustrating the the utility of tSMOTE in this context, we use the same architecture to train a clinical model for COVID-19 disease severity on an imputed data set. Our experiments show an improvement over standard mean and median imputation techniques by allowing a wider class of patient trajectories to be recognized by the model, as well as improvement over aggregated classification models.
    Discrete Simulation Optimization for Tuning Machine Learning Method Hyperparameters. (arXiv:2201.05978v1 [cs.LG])
    (0 min) Machine learning methods are being increasingly used in most technical areas such as image recognition, product recommendation, financial analysis, medical diagnosis, and predictive maintenance. The key question that arises is: how do we control the learning process according to our requirement for the problem? Hyperparameter tuning is used to choose the optimal set of hyperparameters for controlling the learning process of a model. Selecting the appropriate hyperparameters directly impacts the performance measure a model. We have used simulation optimization using discrete search methods like ranking and selection (R&S) methods such as the KN method and stochastic ruler method and its variations for hyperparameter optimization and also developed the theoretical basis for applying common R&S methods. The KN method finds the best possible system with statistical guarantee and stochastic ruler method asymptotically converges to the optimal solution and is also computationally very efficient. We also benchmarked our results with state of art hyperparameter optimization libraries such as $hyperopt$ and $mango$ and found KN and stochastic ruler to be performing consistently better than $hyperopt~rand$ and stochastic ruler to be equally efficient in comparison with $hyperopt~tpe$ in most cases, even when our computational implementations are not yet optimized in comparison to professional packages.
    Convolutional Signature for Sequential Data. (arXiv:2009.06719v2 [stat.ML] UPDATED)
    (0 min) Signature is an infinite graded sequence of statistics known to characterize geometric rough paths, which includes the paths with bounded variation. This object has been studied successfully for machine learning with mostly applications in low dimensional cases. In the high dimensional case, it suffers from exponential growth in the number of features in truncated signature transform. We propose a novel neural network based model which borrows the idea from Convolutional Neural Network to address this problem. Our model reduces the number of features efficiently in a data dependent way. Some empirical experiments are provided to support our model.
    Deep Optimal Transport on SPD Manifolds for Domain Adaptation. (arXiv:2201.05745v1 [cs.LG])
    (0 min) The domain adaption (DA) problem on symmetric positive definite (SPD) manifolds has raised interest in the machine learning community because of the growing potential for the SPD-matrix representations across many non-stationary applicable scenarios. This paper generalizes the joint distribution adaption (JDA) to align the source and target domains on SPD manifolds and proposes a deep network architecture, Deep Optimal Transport (DOT), using the generalized JDA and the existing deep network architectures on SPD manifolds. The specific architecture in DOT enables it to learn an approximate optimal transport (OT) solution to the DA problems on SPD manifolds. In the experiments, DOT exhibits a 2.32% and 2.92% increase on the average accuracy in two highly non-stationary cross-session scenarios in brain-computer interfaces (BCIs), respectively. The visualizational results of the source and target domains before and after the transformation also demonstrate the validity of DOT.
    Tools and Practices for Responsible AI Engineering. (arXiv:2201.05647v1 [cs.LG])
    (0 min) Responsible Artificial Intelligence (AI) - the practice of developing, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability - represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries - hydra-zen and the rAI-toolbox - that address critical needs for responsible AI engineering. hydra-zen dramatically simplifies the process of making complex AI applications configurable, and their behaviors reproducible. The rAI-toolbox is designed to enable methods for evaluating and enhancing the robustness of AI-models in a way that is scalable and that composes naturally with other popular ML frameworks. We describe the design principles and methodologies that make these tools effective, including the use of property-based testing to bolster the reliability of the tools themselves. Finally, we demonstrate the composability and flexibility of the tools by showing how various use cases from adversarial robustness and explainable AI can be concisely implemented with familiar APIs.
    Zero-Shot Machine Unlearning. (arXiv:2201.05629v1 [cs.LG])
    (0 min) With the introduction of new privacy regulations, machine unlearning is becoming an emerging research problem due to an increasing need for regulatory compliance required for machine learning (ML) applications. Modern privacy regulations grant citizens the right to be forgotten by products, services and companies. This necessitates deletion of data not only from storage archives but also from ML model. The right to be forgotten requests come in the form of removal of a certain set or class of data from the already trained ML model. Practical considerations preclude retraining of the model from scratch minus the deleted data. The few existing studies use the whole training data, or a subset of training data, or some metadata stored during training to update the model weights for unlearning. However, strict regulatory compliance requires time-bound deletion of data. Thus, in many cases, no data related to the training process or training samples may be accessible even for the unlearning purpose. We therefore ask the question: is it possible to achieve unlearning with zero training samples? In this paper, we introduce the novel problem of zero-shot machine unlearning that caters for the extreme but practical scenario where zero original data samples are available for use. We then propose two novel solutions for zero-shot machine unlearning based on (a) error minimizing-maximizing noise and (b) gated knowledge transfer. We also introduce a new evaluation metric, Anamnesis Index (AIN) to effectively measure the quality of the unlearning method. The experiments show promising results for unlearning in deep learning models on benchmark vision data-sets. The source code will be made publicly available.
    Smart Parking Space Detection under Hazy conditions using Convolutional Neural Networks: A Novel Approach. (arXiv:2201.05858v1 [cs.CV])
    (0 min) Limited urban parking space combined with urbanization has necessitated the development of smart parking systems that can communicate the availability of parking slots to the end users. Towards this, various deep learning based solutions using convolutional neural networks have been proposed for parking space occupation detection. Though these approaches are robust to partial obstructions and lighting conditions, their performance is found to degrade in the presence of haze conditions. Looking in this direction, this paper investigates the use of dehazing networks that improves the performance of parking space occupancy classifier under hazy conditions. Additionally, training procedures are proposed for dehazing networks to maximize the performance of the system on both hazy and non-hazy conditions. The proposed system is deployable as part of existing smart parking systems where limited number of cameras are used to monitor hundreds of parking spaces. To validate our approach, we have developed a custom hazy parking system dataset from real-world task-driven test set of RESIDE-\b{eta} dataset. The proposed approach is tested against existing state-of-the-art parking space detectors on CNRPark-EXT and hazy parking system datasets. Experimental results indicate that there is a significant accuracy improvement of the proposed approach on the hazy parking system dataset.
    GearNet: Stepwise Dual Learning for Weakly Supervised Domain Adaptation. (arXiv:2201.06001v1 [cs.LG])
    (0 min) This paper studies weakly supervised domain adaptation(WSDA) problem, where we only have access to the source domain with noisy labels, from which we need to transfer useful information to the unlabeled target domain. Although there have been a few studies on this problem, most of them only exploit unidirectional relationships from the source domain to the target domain. In this paper, we propose a universal paradigm called GearNet to exploit bilateral relationships between the two domains. Specifically, we take the two domains as different inputs to train two models alternately, and asymmetrical Kullback-Leibler loss is used for selectively matching the predictions of the two models in the same domain. This interactive learning schema enables implicit label noise canceling and exploits correlations between the source and target domains. Therefore, our GearNet has the great potential to boost the performance of a wide range of existing WSDL methods. Comprehensive experimental results show that the performance of existing methods can be significantly improved by equipping with our GearNet.
    Deep learning of multi-resolution X-Ray micro-CT images for multi-scale modelling. (arXiv:2111.01270v2 [physics.geo-ph] UPDATED)
    (0 min) Field-of-view and resolution trade-offs in X-Ray micro-computed tomography (micro-CT) imaging limit the characterization, analysis and model development of multi-scale porous systems. To this end, we developed an applied methodology utilising deep learning to enhance low resolution images over large sample sizes and create multi-scale models capable of accurately simulating experimental fluid dynamics from the pore (microns) to continuum (centimetres) scale. We develop a 3D Enhanced Deep Super Resolution (EDSR) convolutional neural network to create super resolution (SR) images from low resolution images, which alleviates common micro-CT hardware/reconstruction defects in high-resolution (HR) images. When paired with pore-network simulations and parallel computation, we can create large 3D continuum-scale models with spatially varying flow & material properties. We quantitatively validate the workflow at various scales using direct HR/SR image similarity, pore-scale material/flow simulations and continuum scale multiphase flow experiments (drainage immiscible flow pressures and 3D fluid volume fractions). The SR images and models are comparable to the HR ground truth, and generally accurate to within experimental uncertainty at the continuum scale across a range of flow rates. They are found to be significantly more accurate than their LR counterparts, especially in cases where a wide distribution of pore-sizes are encountered. The applied methodology opens up the possibility to image, model and analyse truly multi-scale heterogeneous systems that are otherwise intractable.
    Variance-Reduced Heterogeneous Federated Learning via Stratified Client Selection. (arXiv:2201.05762v1 [cs.LG])
    (0 min) Client selection strategies are widely adopted to handle the communication-efficient problem in recent studies of Federated Learning (FL). However, due to the large variance of the selected subset's update, prior selection approaches with a limited sampling ratio cannot perform well on convergence and accuracy in heterogeneous FL. To address this problem, in this paper, we propose a novel stratified client selection scheme to reduce the variance for the pursuit of better convergence and higher accuracy. Specifically, to mitigate the impact of heterogeneity, we develop stratification based on clients' local data distribution to derive approximate homogeneous strata for better selection in each stratum. Concentrating on a limited sampling ratio scenario, we next present an optimized sample size allocation scheme by considering the diversity of stratum's variability, with the promise of further variance reduction. Theoretically, we elaborate the explicit relation among different selection schemes with regard to variance, under heterogeneous settings, we demonstrate the effectiveness of our selection scheme. Experimental results confirm that our approach not only allows for better performance relative to state-of-the-art methods but also is compatible with prevalent FL algorithms.
    Hyperplane bounds for neural feature mappings. (arXiv:2201.05799v1 [cs.LG])
    (0 min) Deep learning methods minimise the empirical risk using loss functions such as the cross entropy loss. When minimising the empirical risk, the generalisation of the learnt function still depends on the performance on the training data, the Vapnik-Chervonenkis(VC)-dimension of the function and the number of training examples. Neural networks have a large number of parameters, which correlates with their VC-dimension that is typically large but not infinite, and typically a large number of training instances are needed to effectively train them. In this work, we explore how to optimize feature mappings using neural network with the intention to reduce the effective VC-dimension of the hyperplane found in the space generated by the mapping. An interpretation of the results of this study is that it is possible to define a loss that controls the VC-dimension of the separating hyperplane. We evaluate this approach and observe that the performance when using this method improves when the size of the training set is small.
    Graph Transfer Learning via Adversarial Domain Adaptation with Graph Convolution. (arXiv:1909.01541v2 [cs.LG] UPDATED)
    (0 min) This paper studies the problem of cross-network node classification to overcome the insufficiency of labeled data in a single network. It aims to leverage the label information in a partially labeled source network to assist node classification in a completely unlabeled or partially labeled target network. Existing methods for single network learning cannot solve this problem due to the domain shift across networks. Some multi-network learning methods heavily rely on the existence of cross-network connections, thus are inapplicable for this problem. To tackle this problem, we propose a novel graph transfer learning framework AdaGCN by leveraging the techniques of adversarial domain adaptation and graph convolution. It consists of two components: a semi-supervised learning component and an adversarial domain adaptation component. The former aims to learn class discriminative node representations with given label information of the source and target networks, while the latter contributes to mitigating the distribution divergence between the source and target domains to facilitate knowledge transfer. Extensive empirical evaluations on real-world datasets show that AdaGCN can successfully transfer class information with a low label rate on the source network and a substantial divergence between the source and target domains.
    Cautious Policy Programming: Exploiting KL Regularization in Monotonic Policy Improvement for Reinforcement Learning. (arXiv:2107.05798v3 [cs.LG] UPDATED)
    (0 min) In this paper, we propose cautious policy programming (CPP), a novel value-based reinforcement learning (RL) algorithm that can ensure monotonic policy improvement during learning. Based on the nature of entropy-regularized RL, we derive a new entropy regularization-aware lower bound of policy improvement that only requires estimating the expected policy advantage function. CPP leverages this lower bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. Different from similar algorithms that are mostly theory-oriented, we also propose a novel interpolation scheme that makes CPP better scale in high dimensional control problems. We demonstrate that the proposed algorithm can trade o? performance and stability in both didactic classic control problems and challenging high-dimensional Atari games.
    An efficient aggregation method for the symbolic representation of temporal data. (arXiv:2201.05697v1 [cs.LG])
    (0 min) Symbolic representations are a useful tool for the dimension reduction of temporal data, allowing for the efficient storage of and information retrieval from time series. They can also enhance the training of machine learning algorithms on time series data through noise reduction and reduced sensitivity to hyperparameters. The adaptive Brownian bridge-based aggregation (ABBA) method is one such effective and robust symbolic representation, demonstrated to accurately capture important trends and shapes in time series. However, in its current form the method struggles to process very large time series. Here we present a new variant of the ABBA method, called fABBA. This variant utilizes a new aggregation approach tailored to the piecewise representation of time series. By replacing the k-means clustering used in ABBA with a sorting-based aggregation technique, and thereby avoiding repeated sum-of-squares error computations, the computational complexity is significantly reduced. In contrast to the original method, the new approach does not require the number of time series symbols to be specified in advance. Through extensive tests we demonstrate that the new method significantly outperforms ABBA with a considerable reduction in runtime while also outperforming the popular SAX and 1d-SAX representations in terms of reconstruction accuracy. We further demonstrate that fABBA can compress other data types such as images.
    Deep Reinforcement Learning for Shared Autonomous Vehicles (SAV) Fleet Management. (arXiv:2201.05720v1 [cs.LG])
    (0 min) Shared Automated Vehicles (SAVs) Fleets companies are starting pilot projects nationwide. In 2020 in Fairfax Virginia it was announced the first Shared Autonomous Vehicle Fleet pilot project in Virginia. SAVs promise to improve quality of life. However, SAVs will also induce some negative externalities by generating excessive vehicle miles traveled (VMT), which leads to more congestions, energy consumption, and emissions. The excessive VMT are primarily generated via empty relocation process. Reinforcement Learning based algorithms are being researched as a possible solution to solve some of these problems: most notably minimizing waiting time for riders. But no research using Reinforcement Learning has been made about reducing parking space cost nor reducing empty cruising time. This study explores different \textbf{Reinforcement Learning approaches and then decide the best approach to help minimize the rider waiting time, parking cost, and empty travel
    Formula graph self-attention network for representation-domain independent materials discovery. (arXiv:2201.05649v1 [cs.LG])
    (0 min) The success of machine learning (ML) in materials property prediction depends heavily on how the materials are represented for learning. Two dominant families of material descriptors exist, one that encodes crystal structure in the representation and the other that only uses stoichiometric information with the hope of discovering new materials. Graph neural networks (GNNs) in particular have excelled in predicting material properties within chemical accuracy. However, current GNNs are limited to only one of the above two avenues owing to the little overlap between respective material representations. Here, we introduce a new concept of formula graph which unifies both stoichiometry-only and structure-based material descriptors. We further develop a self-attention integrated GNN that assimilates a formula graph and show that the proposed architecture produces material embeddings transferable between the two domains. Our model substantially outperforms previous structure-based GNNs as well as structure-agnostic counterparts while exhibiting better sample efficiency and faster convergence. Finally, the model is applied in a challenging exemplar to predict the complex dielectric function of materials and nominate new substances that potentially exhibit epsilon-near-zero phenomena.
    SGLB: Stochastic Gradient Langevin Boosting. (arXiv:2001.07248v5 [cs.LG] UPDATED)
    (0 min) This paper introduces Stochastic Gradient Langevin Boosting (SGLB) - a powerful and efficient machine learning framework that may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of the Langevin diffusion equation specifically designed for gradient boosting. This allows us to theoretically guarantee the global convergence even for multimodal loss functions, while standard gradient boosting algorithms can guarantee only local optimum. We also empirically show that SGLB outperforms classic gradient boosting when applied to classification tasks with 0-1 loss function, which is known to be multimodal.
    Real-World Graph Convolution Networks (RW-GCNs) for Action Recognition in Smart Video Surveillance. (arXiv:2201.05739v1 [cs.CV])
    (0 min) Action recognition is a key algorithmic part of emerging on-the-edge smart video surveillance and security systems. Skeleton-based action recognition is an attractive approach which, instead of using RGB pixel data, relies on human pose information to classify appropriate actions. However, existing algorithms often assume ideal conditions that are not representative of real-world limitations, such as noisy input, latency requirements, and edge resource constraints. To address the limitations of existing approaches, this paper presents Real-World Graph Convolution Networks (RW-GCNs), an architecture-level solution for meeting the domain constraints of Real World Skeleton-based Action Recognition. Inspired by the presence of feedback connections in the human visual cortex, RW-GCNs leverage attentive feedback augmentation on existing near state-of-the-art (SotA) Spatial-Temporal Graph Convolution Networks (ST-GCNs). The ST-GCNs' design choices are derived from information theory-centric principles to address both the spatial and temporal noise typically encountered in end-to-end real-time and on-the-edge smart video systems. Our results demonstrate RW-GCNs' ability to serve these applications by achieving a new SotA accuracy on the NTU-RGB-D-120 dataset at 94.1%, and achieving 32X less latency than baseline ST-GCN applications while still achieving 90.4% accuracy on the Northwestern UCLA dataset in the presence of spatial keypoint noise. RW-GCNs further show system scalability by running on the 10X cost effective NVIDIA Jetson Nano (as opposed to NVIDIA Xavier NX), while still maintaining a respectful range of throughput (15.6 to 5.5 Actions per Second) on the resource constrained device. The code is available here: https://github.com/TeCSAR-UNCC/RW-GCN.
    Recursive Least Squares Advantage Actor-Critic Algorithms. (arXiv:2201.05918v1 [cs.LG])
    (0 min) As an important algorithm in deep reinforcement learning, advantage actor critic (A2C) has been widely succeeded in both discrete and continuous control tasks with raw pixel inputs, but its sample efficiency still needs to improve more. In traditional reinforcement learning, actor-critic algorithms generally use the recursive least squares (RLS) technology to update the parameter of linear function approximators for accelerating their convergence speed. However, A2C algorithms seldom use this technology to train deep neural networks (DNNs) for improving their sample efficiency. In this paper, we propose two novel RLS-based A2C algorithms and investigate their performance. Both proposed algorithms, called RLSSA2C and RLSNA2C, use the RLS method to train the critic network and the hidden layers of the actor network. The main difference between them is at the policy learning step. RLSSA2C uses an ordinary first-order gradient descent algorithm and the standard policy gradient to learn the policy parameter. RLSNA2C uses the Kronecker-factored approximation, the RLS method and the natural policy gradient to learn the compatible parameter and the policy parameter. In addition, we analyze the complexity and convergence of both algorithms, and present three tricks for further improving their convergence speed. Finally, we demonstrate the effectiveness of both algorithms on 40 games in the Atari 2600 environment and 11 tasks in the MuJoCo environment. From the experimental results, it is shown that our both algorithms have better sample efficiency than the vanilla A2C on most games or tasks, and have higher computational efficiency than other two state-of-the-art algorithms.
    Robust Learning-based Predictive Control for Discrete-time Nonlinear Systems with Unknown Dynamics and State Constraints. (arXiv:1911.09827v4 [eess.SY] UPDATED)
    (0 min) Robust model predictive control (MPC) is a well-known control technique for model-based control with constraints and uncertainties. In classic robust tube-based MPC approaches, an open-loop control sequence is computed via periodically solving an online nominal MPC problem, which requires prior model information and frequent access to onboard computational resources. In this paper, we propose an efficient robust MPC solution based on receding horizon reinforcement learning, called r-LPC, for unknown nonlinear systems with state constraints and disturbances. The proposed r-LPC utilizes a Koopman operator-based prediction model obtained off-line from pre-collected input-output datasets. Unlike classic tube-based MPC, in each prediction time interval of r-LPC, we use an actor-critic structure to learn a near-optimal feedback control policy rather than a control sequence. The resulting closed-loop control policy can be learned off-line and deployed online or learned online in an asynchronous way. In the latter case, online learning can be activated whenever necessary; for instance, the safety constraint is violated with the deployed policy. The closed-loop recursive feasibility, robustness, and asymptotic stability are proven under function approximation errors of the actor-critic networks. Simulation and experimental results on two nonlinear systems with unknown dynamics and disturbances have demonstrated that our approach has better or comparable performance when compared with tube-based MPC and LQR, and outperforms a recently developed actor-critic learning approach.
    ATOM3D: Tasks On Molecules in Three Dimensions. (arXiv:2012.04035v4 [cs.LG] UPDATED)
    (0 min) Computational methods that operate on three-dimensional molecular structure have the potential to solve important questions in biology and chemistry. In particular, deep neural networks have gained significant attention, but their widespread adoption in the biomolecular domain has been limited by a lack of either systematic performance benchmarks or a unified toolkit for interacting with molecular data. To address this, we present ATOM3D, a collection of both novel and existing benchmark datasets spanning several key classes of biomolecules. We implement several classes of three-dimensional molecular learning methods for each of these tasks and show that they consistently improve performance relative to methods based on one- and two-dimensional representations. The specific choice of architecture proves to be critical for performance, with three-dimensional convolutional networks excelling at tasks involving complex geometries, graph networks performing well on systems requiring detailed positional information, and the more recently developed equivariant networks showing significant promise. Our results indicate that many molecular problems stand to gain from three-dimensional molecular learning, and that there is potential for improvement on many tasks which remain underexplored. To lower the barrier to entry and facilitate further developments in the field, we also provide a comprehensive suite of tools for dataset processing, model training, and evaluation in our open-source atom3d Python package. All datasets are available for download from https://www.atom3d.ai .
    Exposing Query Identification for Search Transparency. (arXiv:2110.07701v2 [cs.IR] UPDATED)
    (0 min) Search systems control the exposure of ranked content to searchers. In many cases, creators value not only the exposure of their content but, moreover, an understanding of the specific searches where the content is surfaced. The problem of identifying which queries expose a given piece of content in the ranking results is an important and relatively under-explored search transparency challenge. Exposing queries are useful for quantifying various issues of search bias, privacy, data protection, security, and search engine optimization. Exact identification of exposing queries in a given system is computationally expensive, especially in dynamic contexts such as web search. We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems: dense dual-encoder models and traditional BM25 models. We then propose how this approach can be improved through metric learning over the retrieval embedding space. We further derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI. Overall, our work contributes a novel conception of transparency in search systems and computational means of achieving it.
    HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments. (arXiv:2111.10635v2 [cs.DC] UPDATED)
    (0 min) Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). The codes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle.
    Differentially Private (Gradient) Expectation Maximization Algorithm with Statistical Guarantees. (arXiv:2010.13520v3 [cs.LG] UPDATED)
    (0 min) (Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. However, unlike in the non-private case, existing techniques are not yet able to provide finite sample statistical guarantees. To address this issue, we propose in this paper the first DP version of (Gradient) EM algorithm with statistical guarantees. Moreover, we apply our general framework to three canonical models: Gaussian Mixture Model (GMM), Mixture of Regressions Model (MRM) and Linear Regression with Missing Covariates (RMC). Specifically, for GMM in the DP model, our estimation error is near optimal in some cases. For the other two models, we provide the first finite sample statistical guarantees. Our theory is supported by thorough numerical experiments.
    Dataset Distillation with Infinitely Wide Convolutional Networks. (arXiv:2107.13034v3 [cs.LG] UPDATED)
    (0 min) The effectiveness of machine learning algorithms arises from being able to extract useful features from large amounts of data. As model and dataset sizes increase, dataset distillation methods that compress large datasets into significantly smaller yet highly performant ones will become valuable in terms of training efficiency and useful feature extraction. To that end, we apply a novel distributed kernel based meta-learning framework to achieve state-of-the-art results for dataset distillation using infinitely wide convolutional neural networks. For instance, using only 10 datapoints (0.02% of original dataset), we obtain over 65% test accuracy on CIFAR-10 image classification task, a dramatic improvement over the previous best test accuracy of 40%. Our state-of-the-art results extend across many other settings for MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and SVHN. Furthermore, we perform some preliminary analyses of our distilled datasets to shed light on how they differ from naturally occurring data.
    Large-Scale Inventory Optimization: A Recurrent-Neural-Networks-Inspired Simulation Approach. (arXiv:2201.05868v1 [math.OC])
    (0 min) Many large-scale production networks include thousands types of final products and tens to hundreds thousands types of raw materials and intermediate products. These networks face complicated inventory management decisions, which are often too complicated for inventory models and too large for simulation models. In this paper, by combing efficient computational tools of recurrent neural networks (RNN) and the structural information of production networks, we propose a RNN inspired simulation approach that may be thousands times faster than existing simulation approach and is capable of solving large-scale inventory optimization problems in a reasonable amount of time.
    Attention Approximates Sparse Distributed Memory. (arXiv:2111.05498v2 [cs.LG] UPDATED)
    (0 min) While Attention has come to be an important mechanism in deep learning, there remains limited intuition for why it works so well. Here, we show that Transformer Attention can be closely related under certain data conditions to Kanerva's Sparse Distributed Memory (SDM), a biologically plausible associative memory model. We confirm that these conditions are satisfied in pre-trained GPT2 Transformer models. We discuss the implications of the Attention-SDM map and provide new computational and biological interpretations of Attention.
    Self-Supervised GANs with Label Augmentation. (arXiv:2106.08601v5 [cs.LG] UPDATED)
    (0 min) Recently, transformation-based self-supervised learning has been applied to generative adversarial networks (GANs) to mitigate catastrophic forgetting in the discriminator by introducing a stationary learning environment. However, the separate self-supervised tasks in existing self-supervised GANs cause a goal inconsistent with generative modeling due to the fact that their self-supervised classifiers are agnostic to the generator distribution. To address this problem, we propose a novel self-supervised GAN that unifies the GAN task with the self-supervised task by augmenting the GAN labels (real or fake) via self-supervision of data transformation. Specifically, the original discriminator and self-supervised classifier are unified into a label-augmented discriminator that predicts the augmented labels to be aware of both the generator distribution and the data distribution under every transformation, and then provide the discrepancy between them to optimize the generator. Theoretically, we prove that the optimal generator could converge to replicate the real data distribution. Empirically, we show that the proposed method significantly outperforms previous self-supervised and data augmentation GANs on both generative modeling and representation learning across benchmark datasets.
    A Novel Multi-Task Learning Method for Symbolic Music Emotion Recognition. (arXiv:2201.05782v1 [cs.SD])
    (0 min) Symbolic Music Emotion Recognition(SMER) is to predict music emotion from symbolic data, such as MIDI and MusicXML. Previous work mainly focused on learning better representation via (mask) language model pre-training but ignored the intrinsic structure of the music, which is extremely important to the emotional expression of music. In this paper, we present a simple multi-task framework for SMER, which incorporates the emotion recognition task with other emotion-related auxiliary tasks derived from the intrinsic structure of the music. The results show that our multi-task framework can be adapted to different models. Moreover, the labels of auxiliary tasks are easy to be obtained, which means our multi-task methods do not require manually annotated labels other than emotion. Conducting on two publicly available datasets (EMOPIA and VGMIDI), the experiments show that our methods perform better in SMER task. Specifically, accuracy has been increased by 4.17 absolute point to 67.58 in EMOPIA dataset, and 1.97 absolute point to 55.85 in VGMIDI dataset. Ablation studies also show the effectiveness of multi-task methods designed in this paper.
    Incorporating Multiple Cluster Centers for Multi-Label Learning. (arXiv:2004.08113v3 [cs.LG] UPDATED)
    (0 min) Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Most of the existing approaches aim to improve the performance of multi-label learning by exploiting label correlations. Although the data augmentation technique is widely used in many machine learning tasks, it is still unclear whether data augmentation is helpful to multi-label learning. In this article, we propose to leverage the data augmentation technique to improve the performance of multi-label learning. Specifically, we first propose a novel data augmentation approach that performs clustering on the real examples and treats the cluster centers as virtual examples, and these virtual examples naturally embody the local label correlations and label importances. Then, motivated by the cluster assumption that examples in the same cluster should have the same label, we propose a novel regularization term to bridge the gap between the real examples and virtual examples, which can promote the local smoothness of the learning function. Extensive experimental results on a number of real-world multi-label datasets clearly demonstrate that our proposed approach outperforms the state-of-the-art counterparts.
    Enhanced Membership Inference Attacks against Machine Learning Models. (arXiv:2111.09679v2 [cs.LG] UPDATED)
    (0 min) How much does a given trained model leak about each individual data record in its training set? Membership inference attacks are used as an auditing tool to quantify the private information that a model leaks about the individual data points in its training set. Membership inference attacks are influenced by different uncertainties that an attacker has to resolve about training data, the training algorithm, and the underlying data distribution. Thus attack success rates, of many attacks in the literature, do not precisely capture the information leakage of models about their data, as they also reflect other uncertainties that the attack algorithm has. In this paper, we explain the implicit assumptions and also the simplifications made in prior work using the framework of hypothesis testing. We also derive new attack algorithms from the framework that can achieve a high AUC score while also highlighting the different factors that affect their performance. Our algorithms capture a very precise approximation of privacy loss in models, and can be used as a tool to perform an accurate and informed estimation of privacy risk in machine learning models. We provide a thorough empirical evaluation of our attack strategies on various machine learning tasks and benchmark datasets.
    Mixed Diagnostics for Longitudinal Properties of Electron Bunches in a Free-Electron Laser. (arXiv:2201.05769v1 [physics.acc-ph])
    (0 min) Longitudinal properties of electron bunches are critical for the performance of a wide range of scientific facilities. In a free-electron laser, for example, the existing diagnostics only provide very limited longitudinal information of the electron bunch during online tuning and optimization. We leverage the power of artificial intelligence to build a neural network model using experimental data, in order to bring the destructive longitudinal phase space (LPS) diagnostics online virtually and improve the existing current profile online diagnostics which uses a coherent transition radiation (CTR) spectrometer. The model can also serve as a digital twin of the real machine on which algorithms can be tested efficiently and effectively. We demonstrate at the FLASH facility that the encoder-decoder model with more than one decoder can make highly accurate predictions of megapixel LPS images and coherent transition radiation spectra concurrently for electron bunches in a bunch train with broad ranges of LPS shapes and peak currents, which are obtained by scanning all the major control knobs for LPS manipulation. Furthermore, we propose a way to significantly improve the CTR spectrometer online measurement by combining the predicted and measured spectra. Our work showcases how to combine virtual and real diagnostics in order to provide heterogeneous and reliable mixed diagnostics for scientific facilities.
    Reliable Causal Discovery with Improved Exact Search and Weaker Assumptions. (arXiv:2201.05666v1 [cs.LG])
    (0 min) Many of the causal discovery methods rely on the faithfulness assumption to guarantee asymptotic correctness. However, the assumption can be approximately violated in many ways, leading to sub-optimal solutions. Although there is a line of research in Bayesian network structure learning that focuses on weakening the assumption, such as exact search methods with well-defined score functions, they do not scale well to large graphs. In this work, we introduce several strategies to improve the scalability of exact score-based methods in the linear Gaussian setting. In particular, we develop a super-structure estimation method based on the support of inverse covariance matrix which requires assumptions that are strictly weaker than faithfulness, and apply it to restrict the search space of exact search. We also propose a local search strategy that performs exact search on the local clusters formed by each variable and its neighbors within two hops in the super-structure. Numerical experiments validate the efficacy of the proposed procedure, and demonstrate that it scales up to hundreds of nodes with a high accuracy.
    Edge-based Tensor prediction via graph neural networks. (arXiv:2201.05770v1 [cond-mat.mtrl-sci])
    (0 min) Message-passing neural networks (MPNN) have shown extremely high efficiency and accuracy in predicting the physical properties of molecules and crystals, and are expected to become the next-generation material simulation tool after the density functional theory (DFT). However, there is currently a lack of a general MPNN framework for directly predicting the tensor properties of the crystals. In this work, a general framework for the prediction of tensor properties was proposed: the tensor property of a crystal can be decomposed into the average of the tensor contributions of all the atoms in the crystal, and the tensor contribution of each atom can be expanded as the sum of the tensor projections in the directions of the edges connecting the atoms. On this basis, the edge-based expansions of force vectors, Born effective charges (BECs), dielectric (DL) and piezoelectric (PZ) tensors were proposed. These expansions are rotationally equivariant, while the coefficients in these tensor expansions are rotationally invariant scalars which are similar to physical quantities such as formation energy and band gap. The advantage of this tensor prediction framework is that it does not require the network itself to be equivariant. Therefore, in this work, we directly designed the edge-based tensor prediction graph neural network (ETGNN) model on the basis of the invariant graph neural network to predict tensors. The validity and high precision of this tensor prediction framework were shown by the tests of ETGNN on the extended systems, random perturbed structures and JARVIS-DFT datasets. This tensor prediction framework is general for nearly all the GNNs and can achieve higher accuracy with more advanced GNNs in the future.
    Recent Progress in the CUHK Dysarthric Speech Recognition System. (arXiv:2201.05845v1 [eess.AS])
    (0 min) Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.
    Tight Regret Bounds for Noisy Optimization of a Brownian Motion. (arXiv:2001.09327v2 [cs.LG] UPDATED)
    (0 min) We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise. We show that as the smallest possible expected cumulative regret and the smallest possible expected simple regret scale as $\Omega(\sigma\sqrt{T / \log (T)}) \cap \mathcal{O}(\sigma\sqrt{T} \cdot \log T)$ and $\Omega(\sigma / \sqrt{T \log (T)}) \cap \mathcal{O}(\sigma\log T / \sqrt{T})$ respectively, where $\sigma^2$ is the noise variance. Thus, our upper and lower bounds are tight up to a factor of $\mathcal{O}( (\log T)^{1.5} )$. The upper bound uses an algorithm based on confidence bounds and the Markov property of Brownian motion (among other useful properties), and the lower bound is based on a reduction to binary hypothesis testing.
    Distribution-Free Federated Learning with Conformal Predictions. (arXiv:2110.07661v2 [cs.LG] UPDATED)
    (0 min) Federated learning has attracted considerable interest for collaborative machine learning in healthcare to leverage separate institutional datasets while maintaining patient privacy. However, additional challenges such as poor calibration and lack of interpretability may also hamper widespread deployment of federated models into clinical practice, leading to user distrust or misuse of ML tools in high-stakes clinical decision-making. In this paper, we propose to address these challenges by incorporating an adaptive conformal framework into federated learning to ensure distribution-free prediction sets that provide coverage guarantees. Importantly, these uncertainty estimates can be obtained without requiring any additional modifications to the model. Empirical results on the MedMNIST medical imaging benchmark demonstrate our federated method provides tighter coverage over local conformal predictions on 6 different medical imaging datasets for 2D and 3D multi-class classification tasks. Furthermore, we correlate class entropy with prediction set size to assess task uncertainty.
    GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting. (arXiv:2201.05938v1 [cs.LG])
    (0 min) We propose GradTail, an algorithm that uses gradients to improve model performance on the fly in the face of long-tailed training data distributions. Unlike conventional long-tail classifiers which operate on converged - and possibly overfit - models, we demonstrate that an approach based on gradient dot product agreement can isolate long-tailed data early on during model training and improve performance by dynamically picking higher sample weights for that data. We show that such upweighting leads to model improvements for both classification and regression models, the latter of which are relatively unexplored in the long-tail literature, and that the long-tail examples found by gradient alignment are consistent with our semantic expectations.
    Enhancement of Healthcare Data Performance Metrics using Neural Network Machine Learning Algorithms. (arXiv:2201.05962v1 [cs.LG])
    (0 min) Patients are often encouraged to make use of wearable devices for remote collection and monitoring of health data. This adoption of wearables results in a significant increase in the volume of data collected and transmitted. The battery life of the devices is then quickly diminished due to the high processing requirements of the devices. Given the importance attached to medical data, it is imperative that all transmitted data adhere to strict integrity and availability requirements. Reducing the volume of healthcare data for network transmission may improve sensor battery life without compromising accuracy. There is a trade-off between efficiency and accuracy which can be controlled by adjusting the sampling and transmission rates. This paper demonstrates that machine learning can be used to analyse complex health data metrics such as the accuracy and efficiency of data transmission to overcome the trade-off problem. The study uses time series nonlinear autoregressive neural network algorithms to enhance both data metrics by taking fewer samples to transmit. The algorithms were tested with a standard heart rate dataset to compare their accuracy and efficiency. The result showed that the Levenbery-Marquardt algorithm was the best performer with an efficiency of 3.33 and accuracy of 79.17%, which is similar to other algorithms accuracy but demonstrates improved efficiency. This proves that machine learning can improve without sacrificing a metric over the other compared to the existing methods with high efficiency.
    Big Data Application for Network Level Travel Time Prediction. (arXiv:2201.05760v1 [cs.LG])
    (0 min) Travel time is essential in advanced traveler information systems (ATIS). This paper used the big data analytics engines Apache Spark and Apache MXNet for data processing and modeling. The efficiency gain was evaluated by comparing it with popular data science and deep learning frameworks. The hierarchical feature pooling is explored for both between layer and the output layer LSTM (Long-Short-Term-Memory). The designed hierarchical LSTM (hiLSTM) model can consider the dependencies at a different time scale to capture the spatial-temporal correlations from network-level corridor travel time. A self-attention module is then used to connect temporal and spatial features to the fully connected layers, predicting travel time for all corridors instead of a single link/route. Seasonality and autocorrelation were performed to explore the trend of time-varying data. The case study shows that the Hierarchical LSTM with Attention (hiLSTMat) model gives the best result and outperforms baseline models. The California Bay Area corridor travel time dataset covering four-year periods was published from Caltrans Performance Measurement System (PeMS) system.
    Toward Among-Device AI from On-Device AI with Stream Pipelines. (arXiv:2201.06026v1 [cs.LG])
    (0 min) Modern consumer electronic devices often provide intelligence services with deep neural networks. We have started migrating the computing locations of intelligence services from cloud servers (traditional AI systems) to the corresponding devices (on-device AI systems). On-device AI systems generally have the advantages of preserving privacy, removing network latency, and saving cloud costs. With the emergent of on-device AI systems having relatively low computing power, the inconsistent and varying hardware resources and capabilities pose difficulties. Authors' affiliation has started applying a stream pipeline framework, NNStreamer, for on-device AI systems, saving developmental costs and hardware resources and improving performance. We want to expand the types of devices and applications with on-device AI services products of both the affiliation and second/third parties. We also want to make each AI service atomic, re-deployable, and shared among connected devices of arbitrary vendors; we now have yet another requirement introduced as it always has been. The new requirement of "among-device AI" includes connectivity between AI pipelines so that they may share computing resources and hardware capabilities across a wide range of devices regardless of vendors and manufacturers. We propose extensions of the stream pipeline framework, NNStreamer, for on-device AI so that NNStreamer may provide among-device AI capability. This work is a Linux Foundation (LF AI and Data) open source project accepting contributions from the general public.
    IBAC: An Intelligent Dynamic Bandwidth Channel Access Avoiding Outside Warning Range Problem. (arXiv:2201.05727v1 [cs.NI])
    (0 min) IEEE 802.11ax uses the concept of primary and secondary channels, leading to the Dynamic Bandwidth Channel Access (DBCA) mechanism. By applying DBCA, a wireless station can select a wider channel bandwidth, such as 40/80/160 MHz, by applying the channel bonding feature. However, during channel bonding, inappropriate bandwidth selection can cause collisions. Therefore, to avoid collisions, a well-developed media access control (MAC) protocol is crucial to effectively utilize the channel bonding mechanism. In this paper, we address a collision scenario, called Outside Warning Range Problem (OWRP), that may occur during DBCA when a wireless station interferes with another wireless station after channel bonding is performed. Therefore, we propose a MAC layer mechanism, Intelligent Bonding Avoiding Collision (IBAC), that adapts the channel bonding level in DBCA in order to avoid the OWRP. We first design a theoretical model based on Markov chains for DBCA while avoiding the OWRP. Based on this model, we design a Thompson sampling based Bayesian approach to select the best possible channel bonding level intelligently. We analyze the performance of the IBAC through simulations where it is observed that, comparing to other competing mechanisms, the proposed approach can enhance the network performance significantly while avoiding the OWRP.
    Unbiased deep solvers for linear parametric PDEs. (arXiv:1810.05094v4 [q-fin.CP] UPDATED)
    (0 min) We develop several deep learning algorithms for approximating families of parametric PDE solutions. The proposed algorithms approximate solutions together with their gradients, which in the context of mathematical finance means that the derivative prices and hedging strategies are computed simulatenously. Having approximated the gradient of the solution one can combine it with a Monte-Carlo simulation to remove the bias in the deep network approximation of the PDE solution (derivative price). This is achieved by leveraging the Martingale Representation Theorem and combining the Monte Carlo simulation with the neural network. The resulting algorithm is robust with respect to quality of the neural network approximation and consequently can be used as a black-box in case only limited a priori information about the underlying problem is available. We believe this is important as neural network based algorithms often require fair amount of tuning to produce satisfactory results. The methods are empirically shown to work for high-dimensional problems (e.g. 100 dimensions). We provide diagnostics that shed light on appropriate network architectures.
    Neural-Symbolic Integration for Interactive Learning and Conceptual Grounding. (arXiv:2112.11805v2 [cs.AI] UPDATED)
    (0 min) We propose neural-symbolic integration for abstract concept explanation and interactive learning. Neural-symbolic integration and explanation allow users and domain-experts to learn about the data-driven decision making process of large neural models. The models are queried using a symbolic logic language. Interaction with the user then confirms or rejects a revision of the neural model using logic-based constraints that can be distilled into the model architecture. The approach is illustrated using the Logic Tensor Network framework alongside Concept Activation Vectors and applied to a Convolutional Neural Network.
    Theoretical analysis and computation of the sample Frechet mean for sets of large graphs based on spectral information. (arXiv:2201.05923v1 [stat.ML])
    (0 min) To characterize the location (mean, median) of a set of graphs, one needs a notion of centrality that is adapted to metric spaces, since graph sets are not Euclidean spaces. A standard approach is to consider the Frechet mean. In this work, we equip a set of graphs with the pseudometric defined by the norm between the eigenvalues of their respective adjacency matrix. Unlike the edit distance, this pseudometric reveals structural changes at multiple scales, and is well adapted to studying various statistical problems for graph-valued data. We describe an algorithm to compute an approximation to the sample Frechet mean of a set of undirected unweighted graphs with a fixed size using this pseudometric.
    Wrapped Classifier with Dummy Teacher for training physics-based classifier at unlabeled radar data. (arXiv:2201.05735v1 [physics.geo-ph])
    (0 min) In the paper a method for automatic classification of signals received by EKB and MAGW ISTP SB RAS coherent scatter radars (8-20MHz operating frequency) during 2021 is described. The method is suitable for automatic physical interpretation of the resulting classification of the experimental data in realtime. We called this algorithm Wrapped Classifier with Dummy Teacher. The method is trained on unlabeled dataset and is based on training optimal physics-based classification using clusterization results. The approach is close to optimal embedding search, where the embedding is interpreted as a vector of probabilities for soft classification. The approach allows to find optimal classification algorithm, based on physically interpretable parameters of the received data, both obtained during physics-based numerical simulation and measured experimentally. Dummy Teacher clusterer used for labeling unlabeled dataset is gaussian mixture clustering algorithm. For algorithm functioning we extended the parameters obtained by the radar with additional parameters, calculated during simulation of radiowave propagation using ray-tracing and IRI-2012 and IGRF models for ionosphere and Earth's magnetic field correspondingly. For clustering by Dummy Teacher we use the whole dataset of available parameters (measured and simulated ones). For classification by Wrapped Classifier we use only well physically interpreted parameters. As a result we trained the classification network and found 11 well-interpretable classes from physical point of view in the available data. Five other found classes are not interpretable from physical point of view, demonstrating the importance of taking into account radiowave propagation for correct classification.
    The Dark Side of the Language: Pre-trained Transformers in the DarkNet. (arXiv:2201.05613v1 [cs.CL])
    (0 min) Pre-trained Transformers are challenging human performances in many natural language processing tasks. The gigantic datasets used for pre-training seem to be the key for their success on existing tasks. In this paper, we explore how a range of pre-trained natural language understanding models perform on truly novel and unexplored data, provided by classification tasks over a DarkNet corpus. Surprisingly, results show that syntactic and lexical neural networks largely outperform pre-trained Transformers. This seems to suggest that pre-trained Transformers have serious difficulties in adapting to radically novel texts.
    Concise Logarithmic Loss Function for Robust Training of Anomaly Detection Model. (arXiv:2201.05748v1 [cs.LG])
    (0 min) Recently, deep learning-based algorithms are widely adopted due to the advantage of being able to establish anomaly detection models without or with minimal domain knowledge of the task. Instead, to train the artificial neural network more stable, it should be better to define the appropriate neural network structure or the loss function. For the training anomaly detection model, the mean squared error (MSE) function is adopted widely. On the other hand, the novel loss function, logarithmic mean squared error (LMSE), is proposed in this paper to train the neural network more stable. This study covers a variety of comparisons from mathematical comparisons, visualization in the differential domain for backpropagation, loss convergence in the training process, and anomaly detection performance. In an overall view, LMSE is superior to the existing MSE function in terms of strongness of loss convergence, anomaly detection performance. The LMSE function is expected to be applicable for training not only the anomaly detection model but also the general generative neural network.
    Domain Adaptation via Bidirectional Cross-Attention Transformer. (arXiv:2201.05887v1 [cs.CV])
    (0 min) Domain Adaptation (DA) aims to leverage the knowledge learned from a source domain with ample labeled data to a target domain with unlabeled data only. Most existing studies on DA contribute to learning domain-invariant feature representations for both domains by minimizing the domain gap based on convolution-based neural networks. Recently, vision transformers significantly improved performance in multiple vision tasks. Built on vision transformers, in this paper we propose a Bidirectional Cross-Attention Transformer (BCAT) for DA with the aim to improve the performance. In the proposed BCAT, the attention mechanism can extract implicit source and target mix-up feature representations to narrow the domain discrepancy. Specifically, in BCAT, we design a weight-sharing quadruple-branch transformer with a bidirectional cross-attention mechanism to learn domain-invariant feature representations. Extensive experiments demonstrate that the proposed BCAT model achieves superior performance on four benchmark datasets over existing state-of-the-art DA methods that are based on convolutions or transformers.
    Tractable structured natural gradient descent using local parameterizations. (arXiv:2102.07405v10 [stat.ML] UPDATED)
    (0 min) Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods.
    Interpretable and Effective Reinforcement Learning for Attacking against Graph-based Rumor Detection. (arXiv:2201.05819v1 [cs.LG])
    (0 min) Social networks are polluted by rumors, which can be detected by machine learning models. However, the models are fragile and understanding the vulnerabilities is critical to rumor detection. Certain vulnerabilities are due to dependencies on the graphs and suspiciousness ranking and are difficult for end-to-end methods to learn from limited noisy data. With a black-box detector, we design features capturing the dependencies to allow a reinforcement learning to learn an effective and interpretable attack policy based on the detector output. To speed up learning, we devise: (i) a credit assignment method that decomposes delayed rewards to individual attacking steps proportional to their effects; (ii) a time-dependent control variate to reduce variance due to large graphs and many attacking steps. On two social rumor datasets, we demonstrate: (i) the effectiveness of the attacks compared to rule-based attacks and end-to-end approaches; (ii) the usefulness of the proposed credit assignment strategy and control variate; (iii) interpretability of the policy when generating strong attacks.
    Logarithmic Regret in Multisecretary and Online Linear Programming Problems with Continuous Valuations. (arXiv:1912.08917v3 [cs.LG] UPDATED)
    (0 min) I study a general revenue management problem in which $ n $ customers arrive sequentially over $ n $ periods, and you must dynamically decide which to satisfy. Satisfying the period-$ t $ customer yields utility $ u_{t} \in \mathbb{R}_{+} $ and decreases your inventory holdings by $ A_{t} \in \mathbb{R}_{+}^{M} $. The customer vectors, $ (u_{t}, A_{t}')' $, are i.i.d., with $ u_{t} $ drawn from a finite-mean continuous distribution and $ A_{t} $ drawn from a bounded discrete or continuous distribution. I study this system's regret, which is the additional utility you could get if you didn't have to make decisions on the fly. I show that if your initial inventory endowment scales linearly with $ n $ then your expected regret is $ \Theta(\log(n)) $ as $ n \rightarrow \infty $. I provide a simple policy that achieves this $ \Theta(\log(n)) $ regret rate. Finally, I extend this result to Arlotto and Gurich's (2019) multisecretary problem with uniformly distributed secretary valuations.
  • stat.ML updates on arXiv.org

    Graph Transfer Learning via Adversarial Domain Adaptation with Graph Convolution. (arXiv:1909.01541v2 [cs.LG] UPDATED)
    (2 min) This paper studies the problem of cross-network node classification to overcome the insufficiency of labeled data in a single network. It aims to leverage the label information in a partially labeled source network to assist node classification in a completely unlabeled or partially labeled target network. Existing methods for single network learning cannot solve this problem due to the domain shift across networks. Some multi-network learning methods heavily rely on the existence of cross-network connections, thus are inapplicable for this problem. To tackle this problem, we propose a novel graph transfer learning framework AdaGCN by leveraging the techniques of adversarial domain adaptation and graph convolution. It consists of two components: a semi-supervised learning component and an adversarial domain adaptation component. The former aims to learn class discriminative node representations with given label information of the source and target networks, while the latter contributes to mitigating the distribution divergence between the source and target domains to facilitate knowledge transfer. Extensive empirical evaluations on real-world datasets show that AdaGCN can successfully transfer class information with a low label rate on the source network and a substantial divergence between the source and target domains.
    Differentially Private (Gradient) Expectation Maximization Algorithm with Statistical Guarantees. (arXiv:2010.13520v3 [cs.LG] UPDATED)
    (2 min) (Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. However, unlike in the non-private case, existing techniques are not yet able to provide finite sample statistical guarantees. To address this issue, we propose in this paper the first DP version of (Gradient) EM algorithm with statistical guarantees. Moreover, we apply our general framework to three canonical models: Gaussian Mixture Model (GMM), Mixture of Regressions Model (MRM) and Linear Regression with Missing Covariates (RMC). Specifically, for GMM in the DP model, our estimation error is near optimal in some cases. For the other two models, we provide the first finite sample statistical guarantees. Our theory is supported by thorough numerical experiments.
    Neighborhood Region Smoothing Regularization for Finding Flat Minima In Deep Neural Networks. (arXiv:2201.06064v1 [cs.LG])
    (2 min) Due to diverse architectures in deep neural networks (DNNs) with severe overparameterization, regularization techniques are critical for finding optimal solutions in the huge hypothesis space. In this paper, we propose an effective regularization technique, called Neighborhood Region Smoothing (NRS). NRS leverages the finding that models would benefit from converging to flat minima, and tries to regularize the neighborhood region in weight space to yield approximate outputs. Specifically, gap between outputs of models in the neighborhood region is gauged by a defined metric based on Kullback-Leibler divergence. This metric provides similar insights with the minimum description length principle on interpreting flat minima. By minimizing both this divergence and empirical loss, NRS could explicitly drive the optimizer towards converging to flat minima. We confirm the effectiveness of NRS by performing image classification tasks across a wide range of model architectures on commonly-used datasets such as CIFAR and ImageNet, where generalization ability could be universally improved. Also, we empirically show that the minima found by NRS would have relatively smaller Hessian eigenvalues compared to the conventional method, which is considered as the evidence of flat minima.
    Extended Stochastic Block Models with Application to Criminal Networks. (arXiv:2007.08569v3 [stat.ME] UPDATED)
    (0 min) Reliably learning group structures among nodes in network data is challenging in several applications. We are particularly motivated by studying covert networks that encode relationships among criminals. These data are subject to measurement errors, and exhibit a complex combination of an unknown number of core-periphery, assortative and disassortative structures that may unveil key architectures of the criminal organization. The coexistence of these noisy block patterns limits the reliability of routinely-used community detection algorithms, and requires extensions of model-based solutions to realistically characterize the node partition process, incorporate information from node attributes, and provide improved strategies for estimation and uncertainty quantification. To cover these gaps, we develop a new class of extended stochastic block models (ESBM) that infer groups of nodes having common connectivity patterns via Gibbs-type priors on the partition process. This choice encompasses many realistic priors for criminal networks, covering solutions with fixed, random and infinite number of possible groups, and facilitates the inclusion of node attributes in a principled manner. Among the new alternatives in our class, we focus on the Gnedin process as a realistic prior that allows the number of groups to be finite, random and subject to a reinforcement process coherent with criminal networks. A collapsed Gibbs sampler is proposed for the whole ESBM class, and refined strategies for estimation, prediction, uncertainty quantification and model selection are outlined. The ESBM performance is illustrated in realistic simulations and in an application to an Italian mafia network, where we unveil key complex block structures, mostly hidden from state-of-the-art alternatives.
    Deep learning via LSTM models for COVID-19 infection forecasting in India. (arXiv:2101.11881v2 [cs.LG] UPDATED)
    (0 min) The COVID-19 pandemic continues to have major impact to health and medical infrastructure, economy, and agriculture. Prominent computational and mathematical models have been unreliable due to the complexity of the spread of infections. Moreover, lack of data collection and reporting makes modelling attempts difficult and unreliable. Hence, we need to re-look at the situation with reliable data sources and innovative forecasting models. Deep learning models such as recurrent neural networks are well suited for modelling spatiotemporal sequences. In this paper, we apply recurrent neural networks such as long short term memory (LSTM), bidirectional LSTM, and encoder-decoder LSTM models for multi-step (short-term) COVID-19 infection forecasting. We select Indian states with COVID-19 hotpots and capture the first (2020) and second (2021) wave of infections and provide two months ahead forecast. Our model predicts that the likelihood of another wave of infections in October and November 2021 is low; however, the authorities need to be vigilant given emerging variants of the virus. The accuracy of the predictions motivate the application of the method in other countries and regions. Nevertheless, the challenges in modelling remain due to the reliability of data and difficulties in capturing factors such as population density, logistics, and social aspects such as culture and lifestyle.
    Enhanced Membership Inference Attacks against Machine Learning Models. (arXiv:2111.09679v2 [cs.LG] UPDATED)
    (0 min) How much does a given trained model leak about each individual data record in its training set? Membership inference attacks are used as an auditing tool to quantify the private information that a model leaks about the individual data points in its training set. Membership inference attacks are influenced by different uncertainties that an attacker has to resolve about training data, the training algorithm, and the underlying data distribution. Thus attack success rates, of many attacks in the literature, do not precisely capture the information leakage of models about their data, as they also reflect other uncertainties that the attack algorithm has. In this paper, we explain the implicit assumptions and also the simplifications made in prior work using the framework of hypothesis testing. We also derive new attack algorithms from the framework that can achieve a high AUC score while also highlighting the different factors that affect their performance. Our algorithms capture a very precise approximation of privacy loss in models, and can be used as a tool to perform an accurate and informed estimation of privacy risk in machine learning models. We provide a thorough empirical evaluation of our attack strategies on various machine learning tasks and benchmark datasets.
    Fair Interpretable Learning via Correction Vectors. (arXiv:2201.06343v1 [cs.LG])
    (0 min) Neural network architectures have been extensively employed in the fair representation learning setting, where the objective is to learn a new representation for a given vector which is independent of sensitive information. Various "representation debiasing" techniques have been proposed in the literature. However, as neural networks are inherently opaque, these methods are hard to comprehend, which limits their usefulness. We propose a new framework for fair representation learning which is centered around the learning of "correction vectors", which have the same dimensionality as the given data vectors. The corrections are then simply summed up to the original features, and can therefore be analyzed as an explicit penalty or bonus to each feature. We show experimentally that a fair representation learning problem constrained in such a way does not impact performance.
    Physical Derivatives: Computing policy gradients by physical forward-propagation. (arXiv:2201.05830v1 [cs.RO])
    (0 min) Model-free and model-based reinforcement learning are two ends of a spectrum. Learning a good policy without a dynamic model can be prohibitively expensive. Learning the dynamic model of a system can reduce the cost of learning the policy, but it can also introduce bias if it is not accurate. We propose a middle ground where instead of the transition model, the sensitivity of the trajectories with respect to the perturbation of the parameters is learned. This allows us to predict the local behavior of the physical system around a set of nominal policies without knowing the actual model. We assay our method on a custom-built physical robot in extensive experiments and show the feasibility of the approach in practice. We investigate potential challenges when applying our method to physical systems and propose solutions to each of them.
    Socioeconomic disparities and COVID-19: the causal connections. (arXiv:2201.07026v1 [cs.LG])
    (0 min) The analysis of causation is a challenging task that can be approached in various ways. With the increasing use of machine learning based models in computational socioeconomics, explaining these models while taking causal connections into account is a necessity. In this work, we advocate the use of an explanatory framework from cooperative game theory augmented with $do$ calculus, namely causal Shapley values. Using causal Shapley values, we analyze socioeconomic disparities that have a causal link to the spread of COVID-19 in the USA. We study several phases of the disease spread to show how the causal connections change over time. We perform a causal analysis using random effects models and discuss the correspondence between the two methods to verify our results. We show the distinct advantages a non-linear machine learning models have over linear models when performing a multivariate analysis, especially since the machine learning models can map out non-linear correlations in the data. In addition, the causal Shapley values allow for including the causal structure in the variable importance computed for the machine learning model.
    On the Number of Edges of the Frechet Mean and Median Graphs. (arXiv:2105.14397v4 [math.CO] UPDATED)
    (0 min) The availability of large datasets composed of graphs creates an unprecedented need to invent novel tools in statistical learning for graph-valued random variables. To characterize the average of a sample of graphs, one can compute the sample Frechet mean and median graphs. In this paper, we address the following foundational question: does a mean or median graph inherit the structural properties of the graphs in the sample? An important graph property is the edge density; we establish that edge density is an hereditary property, which can be transmitted from a graph sample to its sample Frechet mean or median graphs, irrespective of the method used to estimate the mean or the median. Because of the prominence of the Frechet mean in graph-valued machine learning, this novel theoretical result has some significant practical consequences.
    A Short Tutorial on The Weisfeiler-Lehman Test And Its Variants. (arXiv:2201.07083v1 [stat.ML])
    (0 min) Graph neural networks are designed to learn functions on graphs. Typically, the relevant target functions are invariant with respect to actions by permutations. Therefore the design of some graph neural network architectures has been inspired by graph-isomorphism algorithms. The classical Weisfeiler-Lehman algorithm (WL) -- a graph-isomorphism test based on color refinement -- became relevant to the study of graph neural networks. The WL test can be generalized to a hierarchy of higher-order tests, known as $k$-WL. This hierarchy has been used to characterize the expressive power of graph neural networks, and to inspire the design of graph neural network architectures. A few variants of the WL hierarchy appear in the literature. The goal of this short note is pedagogical and practical: We explain the differences between the WL and folklore-WL formulations, with pointers to existing discussions in the literature. We illuminate the differences between the formulations by visualizing an example.
    Treatment Effect Risk: Bounds and Inference. (arXiv:2201.05893v1 [stat.ME])
    (0 min) Since the average treatment effect (ATE) measures the change in social welfare, even if positive, there is a risk of negative effect on, say, some 10% of the population. Assessing such risk is difficult, however, because any one individual treatment effect (ITE) is never observed so the 10% worst-affected cannot be identified, while distributional treatment effects only compare the first deciles within each treatment group, which does not correspond to any 10%-subpopulation. In this paper we consider how to nonetheless assess this important risk measure, formalized as the conditional value at risk (CVaR) of the ITE distribution. We leverage the availability of pre-treatment covariates and characterize the tightest-possible upper and lower bounds on ITE-CVaR given by the covariate-conditional average treatment effect (CATE) function. Some bounds can also be interpreted as summarizing a complex CATE function into a single metric and are of interest independently of being a bound. We then proceed to study how to estimate these bounds efficiently from data and construct confidence intervals. This is challenging even in randomized experiments as it requires understanding the distribution of the unknown CATE function, which can be very complex if we use rich covariates so as to best control for heterogeneity. We develop a debiasing method that overcomes this and prove it enjoys favorable statistical properties even when CATE and other nuisances are estimated by black-box machine learning or even inconsistently. Studying a hypothetical change to French job-search counseling services, our bounds and inference demonstrate a small social benefit entails a negative impact on a substantial subpopulation.
    Robust Learning-based Predictive Control for Discrete-time Nonlinear Systems with Unknown Dynamics and State Constraints. (arXiv:1911.09827v4 [eess.SY] UPDATED)
    (0 min) Robust model predictive control (MPC) is a well-known control technique for model-based control with constraints and uncertainties. In classic robust tube-based MPC approaches, an open-loop control sequence is computed via periodically solving an online nominal MPC problem, which requires prior model information and frequent access to onboard computational resources. In this paper, we propose an efficient robust MPC solution based on receding horizon reinforcement learning, called r-LPC, for unknown nonlinear systems with state constraints and disturbances. The proposed r-LPC utilizes a Koopman operator-based prediction model obtained off-line from pre-collected input-output datasets. Unlike classic tube-based MPC, in each prediction time interval of r-LPC, we use an actor-critic structure to learn a near-optimal feedback control policy rather than a control sequence. The resulting closed-loop control policy can be learned off-line and deployed online or learned online in an asynchronous way. In the latter case, online learning can be activated whenever necessary; for instance, the safety constraint is violated with the deployed policy. The closed-loop recursive feasibility, robustness, and asymptotic stability are proven under function approximation errors of the actor-critic networks. Simulation and experimental results on two nonlinear systems with unknown dynamics and disturbances have demonstrated that our approach has better or comparable performance when compared with tube-based MPC and LQR, and outperforms a recently developed actor-critic learning approach.
    Convergence of a robust deep FBSDE method for stochastic control. (arXiv:2201.06854v1 [math.OC])
    (0 min) In this paper we propose a deep learning based numerical scheme for strongly coupled FBSDE, stemming from stochastic control. It is a modification of the deep BSDE method in which the initial value to the backward equation is not a free parameter, and with a new loss function being the weighted sum of the cost of the control problem, and a variance term which coincides with the means square error in the terminal condition. We show by a numerical example that a direct extension of the classical deep BSDE method to FBSDE, fails for a simple linear-quadratic control problem, and motivate why the new method works. Under regularity and boundedness assumptions on the exact controls of time continuous and time discrete control problems we provide an error analysis for our method. We show empirically that the method converges for three different problems, one being the one that failed for a direct extension of the deep BSDE method.
    Targeted Optimal Treatment Regime Learning Using Summary Statistics. (arXiv:2201.06229v1 [stat.ME])
    (0 min) Personalized decision-making, aiming to derive optimal individualized treatment rules (ITRs) based on individual characteristics, has recently attracted increasing attention in many fields, such as medicine, social services, and economics. Current literature mainly focuses on estimating ITRs from a single source population. In real-world applications, the distribution of a target population can be different from that of the source population. Therefore, ITRs learned by existing methods may not generalize well to the target population. Due to privacy concerns and other practical issues, individual-level data from the target population is often not available, which makes ITR learning more challenging. We consider an ITR estimation problem where the source and target populations may be heterogeneous, individual data is available from the source population, and only the summary information of covariates, such as moments, is accessible from the target population. We develop a weighting framework that tailors an ITR for a given target population by leveraging the available summary statistics. Specifically, we propose a calibrated augmented inverse probability weighted estimator of the value function for the target population and estimate an optimal ITR by maximizing this estimator within a class of pre-specified ITRs. We show that the proposed calibrated estimator is consistent and asymptotically normal even with flexible semi/nonparametric models for nuisance function approximation, and the variance of the value estimator can be consistently estimated. We demonstrate the empirical performance of the proposed method using simulation studies and a real application to an eICU dataset as the source sample and a MIMIC-III dataset as the target sample.
    Online, Informative MCMC Thinning with Kernelized Stein Discrepancy. (arXiv:2201.07130v1 [cs.LG])
    (0 min) A fundamental challenge in Bayesian inference is efficient representation of a target distribution. Many non-parametric approaches do so by sampling a large number of points using variants of Markov Chain Monte Carlo (MCMC). We propose an MCMC variant that retains only those posterior samples which exceed a KSD threshold, which we call KSD Thinning. We establish the convergence and complexity tradeoffs for several settings of KSD Thinning as a function of the KSD threshold parameter, sample size, and other problem parameters. Finally, we provide experimental comparisons against other online nonparametric Bayesian methods that generate low-complexity posterior representations, and observe superior consistency/complexity tradeoffs. Code is available at github.com/colehawkins/KSD-Thinning.
    Particle Dual Averaging: Optimization of Mean Field Neural Networks with Global Convergence Rate Analysis. (arXiv:2012.15477v3 [stat.ML] UPDATED)
    (0 min) We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of measures, we analyze PDA in regularized empirical / expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.
    Self-supervised Representation Learning of Neuronal Morphologies. (arXiv:2112.12482v2 [stat.ML] UPDATED)
    (0 min) Understanding the diversity of cell types and their function in the brain is one of the key challenges in neuroscience. The advent of large-scale datasets has given rise to the need of unbiased and quantitative approaches to cell type classification. We present GraphDINO, a purely data-driven approach to learning a low dimensional representation of the 3D morphology of neurons. GraphDINO is a novel graph representation learning method for spatial graphs utilizing self-supervised learning on transformer models. It combines attention-based global interaction between nodes and classic graph convolutional processing. We show, in two different species and cortical areas, that this method is able to yield morphological cell type clustering that is comparable to manual feature-based classification and shows a good correspondence to expert-labeled cell types. Our method is applicable beyond neuroscience in settings where samples in a dataset are graphs and graph-level embeddings are desired.
    Neuron-Specific Dropout: A Deterministic Regularization Technique to Prevent Neural Networks from Overfitting & Reduce Dependence on Large Training Samples. (arXiv:2201.06938v1 [cs.LG])
    (0 min) In order to develop complex relationships between their inputs and outputs, deep neural networks train and adjust large number of parameters. To make these networks work at high accuracy, vast amounts of data are needed. Sometimes, however, the quantity of data needed is not present or obtainable for training. Neuron-specific dropout (NSDropout) is a tool to address this problem. NSDropout looks at both the training pass, and validation pass, of a layer in a model. By comparing the average values produced by each neuron for each class in a data set, the network is able to drop targeted units. The layer is able to predict what features, or noise, the model is looking at during testing that isn't present when looking at samples from validation. Unlike dropout, the "thinned" networks cannot be "unthinned" for testing. Neuron-specific dropout has proved to achieve similar, if not better, testing accuracy with far less data than traditional methods including dropout and other regularization methods. Experimentation has shown that neuron-specific dropout reduces the chance of a network overfitting and reduces the need for large training samples on supervised learning tasks in image recognition, all while producing best-in-class results.
    Tractable structured natural gradient descent using local parameterizations. (arXiv:2102.07405v10 [stat.ML] UPDATED)
    (0 min) Natural-gradient descent (NGD) on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to difficult Fisher-matrix computations. We address this issue by using \emph{local-parameter coordinates} to obtain a flexible and efficient NGD method that works well for a wide-variety of structured parameterizations. We show four applications where our method (1) generalizes the exponential natural evolutionary strategy, (2) recovers existing Newton-like algorithms, (3) yields new structured second-order algorithms via matrix groups, and (4) gives new algorithms to learn covariances of Gaussian and Wishart-based distributions. We show results on a range of problems from deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods.
    Equitable Community Resilience: The Case of Winter Storm Uri in Texas. (arXiv:2201.06652v1 [stat.ML])
    (0 min) Community resilience in the face of natural hazards relies on a community's potential to bounce back. A failure to integrate equity into resilience considerations results in unequal recovery and disproportionate impacts on vulnerable populations, which has long been a concern in the United States. This research investigated aspects of equity related to community resilience in the aftermath of Winter Storm Uri in Texas which led to extended power outages for more than 4 million households. County level outage and recovery data was analyzed to explore potential significant links between various county attributes and their share of the outages during the recovery and restoration phases. Next, satellite imagery was used to examine data at a much higher geographical resolution focusing on census tracts in the city of Houston. The goal was to use computer vision to extract the extent of outages within census tracts and investigate their linkages to census tracts attributes. Results from various statistical procedures revealed statistically significant negative associations between counties' percentage of non-Hispanic whites and median household income with the ratio of outages. Additionally, at census tract level, variables including percentages of linguistically isolated population and public transport users exhibited positive associations with the group of census tracts that were affected by the outage as detected by computer vision analysis. Informed by these results, engineering solutions such as the applicability of grid modernization technologies, together with distributed and renewable energy resources, when controlled for the region's topographical characteristics, are proposed to enhance equitable power grid resiliency in the face of natural hazards.
    Incompleteness of graph convolutional neural networks for points clouds in three dimensions. (arXiv:2201.07136v1 [stat.ML])
    (0 min) Graph convolutional neural networks (GCNN) are very popular methods in machine learning and have been applied very successfully to the prediction of the properties of molecules and materials. First-order GCNNs are well known to be incomplete, i.e., there exist graphs that are distinct but appear identical when seen through the lens of the GCNN. More complicated schemes have thus been designed to increase their resolving power. Applications to molecules (and more generally, point clouds), however, add a geometric dimension to the problem. The most straightforward and prevalent approach to construct graph representation for the molecules regards atoms as vertices in a graph and draws a bond between each pair of atoms within a certain preselected cutoff. Bonds can be decorated with the distance between atoms, and the resulting "distance graph convolution NNs" (dGCNN) have empirically demonstrated excellent resolving power and are widely used in chemical ML. Here we show that even for the restricted case of graphs induced by 3D atom clouds dGCNNs are not complete. We construct pairs of distinct point clouds that generate graphs that, for any cutoff radius, are equivalent based on a first-order Weisfeiler-Lehman test. This class of degenerate structures includes chemically-plausible configurations, setting an ultimate limit to the expressive power of some of the well-established GCNN architectures for atomistic machine learning. Models that explicitly use angular information in the description of atomic environments can resolve these degeneracies.
    Universal Online Learning: an Optimistically Universal Learning Rule. (arXiv:2201.05947v1 [cs.LG])
    (0 min) We study the subject of universal online learning with non-i.i.d. processes for bounded losses. The notion of an universally consistent learning was defined by Hanneke in an effort to study learning theory under minimal assumptions, where the objective is to obtain low long-run average loss for any target function. We are interested in characterizing processes for which learning is possible and whether there exist learning rules guaranteed to be universally consistent given the only assumption that such learning is possible. The case of unbounded losses is very restrictive, since the learnable processes almost surely visit a finite number of points and as a result, simple memorization is optimistically universal. We focus on the bounded setting and give a complete characterization of the processes admitting strong and weak universal learning. We further show that k-nearest neighbor algorithm (kNN) is not optimistically universal and present a novel variant of 1NN which is optimistically universal for general input and value spaces in both strong and weak setting. This closes all COLT 2021 open problems posed by Hanneke on universal online learning.
    Tk-merge: Computationally Efficient Robust Clustering Under General Assumptions. (arXiv:2201.06391v1 [stat.ME])
    (0 min) We address general-shaped clustering problems under very weak parametric assumptions with a two-step hybrid robust clustering algorithm based on trimmed k-means and hierarchical agglomeration. The algorithm has low computational complexity and effectively identifies the clusters also in presence of data contamination. We also present natural generalizations of the approach as well as an adaptive procedure to estimate the amount of contamination in a data-driven fashion. Our proposal outperforms state-of-the-art robust, model-based methods in our numerical simulations and real-world applications related to color quantization for image analysis, human mobility patterns based on GPS data, biomedical images of diabetic retinopathy, and functional data across weather stations.
    Bandit Data-Driven Optimization. (arXiv:2008.11707v2 [cs.LG] UPDATED)
    (0 min) Applications of machine learning in the non-profit and public sectors often feature an iterative workflow of data acquisition, prediction, and optimization of interventions. There are four major pain points that a machine learning pipeline must overcome in order to be actually useful in these settings: small data, data collected only under the default intervention, unmodeled objectives due to communication gap, and unforeseen consequences of the intervention. In this paper, we introduce bandit data-driven optimization, the first iterative prediction-prescription framework to address these pain points. Bandit data-driven optimization combines the advantages of online bandit learning and offline predictive analytics in an integrated framework. We propose PROOF, a novel algorithm for this framework and formally prove that it has no-regret. Using numerical simulations, we show that PROOF achieves superior performance than existing baseline. We also apply PROOF in a detailed case study of food rescue volunteer recommendation, and show that PROOF as a framework works well with the intricacies of ML models in real-world AI for non-profit and public sector applications.
    Stochastic Multi-Armed Bandits with Control Variates. (arXiv:2105.03962v3 [cs.LG] UPDATED)
    (0 min) This paper studies a new variant of the stochastic multi-armed bandits problem where auxiliary information about the arm rewards is available in the form of control variates. In many applications like queuing and wireless networks, the arm rewards are functions of some exogenous variables. The mean values of these variables are known a priori from historical data and can be used as control variates. Leveraging the theory of control variates, we obtain mean estimates with smaller variance and tighter confidence bounds. We develop an upper confidence bound based algorithm named UCB-CV and characterize the regret bounds in terms of the correlation between rewards and control variates when they follow a multivariate normal distribution. We also extend UCB-CV to other distributions using resampling methods like Jackknifing and Splitting. Experiments on synthetic problem instances validate performance guarantees of the proposed algorithms.
    Deep Learning strategies for ProtoDUNE raw data denoising. (arXiv:2103.01596v2 [hep-ph] UPDATED)
    (0 min) In this work, we investigate different machine learning-based strategies for denoising raw simulation data from the ProtoDUNE experiment. The ProtoDUNE detector is hosted by CERN and it aims to test and calibrate the technologies for DUNE, a forthcoming experiment in neutrino physics. The reconstruction workchain consists of converting digital detector signals into physical high-level quantities. We address the first step in reconstruction, namely raw data denoising, leveraging deep learning algorithms. We design two architectures based on graph neural networks, aiming to enhance the receptive field of basic convolutional neural networks. We benchmark this approach against traditional algorithms implemented by the DUNE collaboration. We test the capabilities of graph neural network hardware accelerator setups to speed up training and inference processes.
    Imputing Missing Observations with Time Sliced Synthetic Minority Oversampling Technique. (arXiv:2201.05634v1 [cs.LG])
    (0 min) We present a simple yet novel time series imputation technique with the goal of constructing an irregular time series that is uniform across every sample in a data set. Specifically, we fix a grid defined by the midpoints of non-overlapping bins (dubbed "slices") of observation times and ensure that each sample has values for all of the features at that given time. This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features. To do so, we slightly generalize the well-known class imbalance algorithm SMOTE \cite{smote} to allow component wise nearest neighbor interpolation that preserves correlations when there are no missing features. We visualize the method in the simplified setting of 2-dimensional uncoupled harmonic oscillators. Next, we use tSMOTE to train an Encoder/Decoder long-short term memory (LSTM) model with Logistic Regression for predicting and classifying distinct trajectories of different 2D oscillators. After illustrating the the utility of tSMOTE in this context, we use the same architecture to train a clinical model for COVID-19 disease severity on an imputed data set. Our experiments show an improvement over standard mean and median imputation techniques by allowing a wider class of patient trajectories to be recognized by the model, as well as improvement over aggregated classification models.
    Perfect density models cannot guarantee anomaly detection. (arXiv:2012.03808v3 [cs.LG] UPDATED)
    (0 min) Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities through the lens of reparametrization and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for anomaly detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection.
    Deep Indexed Active Learning for Matching Heterogeneous Entity Representations. (arXiv:2104.03986v2 [cs.DB] UPDATED)
    (0 min) Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time. Code is available at https://github.com/ArjitJ/DIAL
    Iterative Causal Discovery in the Possible Presence of Latent Confounders and Selection Bias. (arXiv:2111.04095v2 [cs.LG] UPDATED)
    (0 min) We present a sound and complete algorithm, called iterative causal discovery (ICD), for recovering causal graphs in the presence of latent confounders and selection bias. ICD relies on the causal Markov and faithfulness assumptions and recovers the equivalence class of the underlying causal graph. It starts with a complete graph, and consists of a single iterative stage that gradually refines this graph by identifying conditional independence (CI) between connected nodes. Independence and causal relations entailed after any iteration are correct, rendering ICD anytime. Essentially, we tie the size of the CI conditioning set to its distance on the graph from the tested nodes, and increase this value in the successive iteration. Thus, each iteration refines a graph that was recovered by previous iterations having smaller conditioning sets -- a higher statistical power -- which contributes to stability. We demonstrate empirically that ICD requires significantly fewer CI tests and learns more accurate causal graphs compared to FCI, FCI+, and RFCI algorithms (code is available at https://github.com/IntelLabs/causality-lab).
    Embodied Self-supervised Learning by Coordinated Sampling and Training. (arXiv:2006.13350v2 [eess.AS] UPDATED)
    (0 min) Self-supervised learning can significantly improve the performance of downstream tasks, however, the dimensions of learned representations normally lack explicit physical meanings. In this work, we propose a novel self-supervised approach to solve inverse problems by employing the corresponding physical forward process so that the learned representations can have explicit physical meanings. The proposed approach works in an analysis-by-synthesis manner to learn an inference network by iteratively sampling and training. At the sampling step, given observed data, the inference network is used to approximate the intractable posterior, from which we sample input parameters and feed them to a physical process to generate data in the observational space; At the training step, the same network is optimized with the sampled paired data. We prove the feasibility of the proposed method by tackling the acoustic-to-articulatory inversion problem to infer articulatory information from speech. Given an articulatory synthesizer, an inference model can be trained completely from scratch with random initialization. Our experiments demonstrate that the proposed method can converge steadily and the network learns to control the articulatory synthesizer to speak like a human. We also demonstrate that trained models can generalize well to unseen speakers or even new languages, and performance can be further improved through self-adaptation.
    Robust uncertainty estimates with out-of-distribution pseudo-inputs training. (arXiv:2201.05890v1 [cs.LG])
    (0 min) Probabilistic models often use neural networks to control their predictive uncertainty. However, when making out-of-distribution (OOD)} predictions, the often-uncontrollable extrapolation properties of neural networks yield poor uncertainty predictions. Such models then don't know what they don't know, which directly limits their robustness w.r.t unexpected inputs. To counter this, we propose to explicitly train the uncertainty predictor where we are not given data to make it reliable. As one cannot train without data, we provide mechanisms for generating pseudo-inputs in informative low-density regions of the input space, and show how to leverage these in a practical Bayesian framework that casts a prior distribution over the model uncertainty. With a holistic evaluation, we demonstrate that this yields robust and interpretable predictions of uncertainty while retaining state-of-the-art performance on diverse tasks such as regression and generative modelling
    Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs. (arXiv:2201.07206v1 [cs.LG])
    (0 min) Arguably the most fundamental question in the theory of generative adversarial networks (GANs) is to understand to what extent GANs can actually learn the underlying distribution. Theoretical and empirical evidence suggests local optimality of the empirical training objective is insufficient. Yet, it does not rule out the possibility that achieving a true population minimax optimal solution might imply distribution learning. In this paper, we show that standard cryptographic assumptions imply that this stronger condition is still insufficient. Namely, we show that if local pseudorandom generators (PRGs) exist, then for a large family of natural continuous target distributions, there are ReLU network generators of constant depth and polynomial size which take Gaussian random seeds so that (i) the output is far in Wasserstein distance from the target distribution, but (ii) no polynomially large Lipschitz discriminator ReLU network can detect this. This implies that even achieving a population minimax optimal solution to the Wasserstein GAN objective is likely insufficient for distribution learning in the usual statistical sense. Our techniques reveal a deep connection between GANs and PRGs, which we believe will lead to further insights into the computational landscape of GANs.
    Fractional SDE-Net: Generation of Time Series Data with Long-term Memory. (arXiv:2201.05974v1 [cs.LG])
    (0 min) In this paper, we focus on generation of time-series data using neural networks. It is often the case that input time-series data, especially taken from real financial markets, is irregularly sampled, and its noise structure is more complicated than i.i.d. type. To generate time series with such a property, we propose fSDE-Net: neural fractional Stochastic Differential Equation Network. It generalizes the neural SDE model by using fractional Brownian motion with Hurst index larger than half, which exhibits long-term memory property. We derive the solver of fSDE-Net and theoretically analyze the existence and uniqueness of the solution to fSDE-Net. Our experiments demonstrate that the fSDE-Net model can replicate distributional properties well.
    On Maximum-a-Posteriori estimation with Plug & Play priors and stochastic gradient descent. (arXiv:2201.06133v1 [stat.ML])
    (0 min) Bayesian methods to solve imaging inverse problems usually combine an explicit data likelihood function with a prior distribution that explicitly models expected properties of the solution. Many kinds of priors have been explored in the literature, from simple ones expressing local properties to more involved ones exploiting image redundancy at a non-local scale. In a departure from explicit modelling, several recent works have proposed and studied the use of implicit priors defined by an image denoising algorithm. This approach, commonly known as Plug & Play (PnP) regularisation, can deliver remarkably accurate results, particularly when combined with state-of-the-art denoisers based on convolutional neural networks. However, the theoretical analysis of PnP Bayesian models and algorithms is difficult and works on the topic often rely on unrealistic assumptions on the properties of the image denoiser. This papers studies maximum-a-posteriori (MAP) estimation for Bayesian models with PnP priors. We first consider questions related to existence, stability and well-posedness, and then present a convergence proof for MAP computation by PnP stochastic gradient descent (PnP-SGD) under realistic assumptions on the denoiser used. We report a range of imaging experiments demonstrating PnP-SGD as well as comparisons with other PnP schemes.
    Reliable Causal Discovery with Improved Exact Search and Weaker Assumptions. (arXiv:2201.05666v1 [cs.LG])
    (0 min) Many of the causal discovery methods rely on the faithfulness assumption to guarantee asymptotic correctness. However, the assumption can be approximately violated in many ways, leading to sub-optimal solutions. Although there is a line of research in Bayesian network structure learning that focuses on weakening the assumption, such as exact search methods with well-defined score functions, they do not scale well to large graphs. In this work, we introduce several strategies to improve the scalability of exact score-based methods in the linear Gaussian setting. In particular, we develop a super-structure estimation method based on the support of inverse covariance matrix which requires assumptions that are strictly weaker than faithfulness, and apply it to restrict the search space of exact search. We also propose a local search strategy that performs exact search on the local clusters formed by each variable and its neighbors within two hops in the super-structure. Numerical experiments validate the efficacy of the proposed procedure, and demonstrate that it scales up to hundreds of nodes with a high accuracy.
    ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees. (arXiv:2111.05496v2 [cs.LG] UPDATED)
    (0 min) Models recently used in the literature proving residual networks (ResNets) are better than linear predictors are actually different from standard ResNets that have been widely used in computer vision. In addition to the assumptions such as scalar-valued output or single residual block, these models have no nonlinearities at the final residual representation that feeds into the final affine layer. To codify such a difference in nonlinearities and reveal a linear estimation property, we define ResNEsts, i.e., Residual Nonlinear Estimators, by simply dropping nonlinearities at the last residual representation from standard ResNets. We show that wide ResNEsts with bottleneck blocks can always guarantee a very desirable training property that standard ResNets aim to achieve, i.e., adding more blocks does not decrease performance given the same set of basis elements. To prove that, we first recognize ResNEsts are basis function models that are limited by a coupling problem in basis learning and linear prediction. Then, to decouple prediction weights from basis learning, we construct a special architecture termed augmented ResNEst (A-ResNEst) that always guarantees no worse performance with the addition of a block. As a result, such an A-ResNEst establishes empirical risk lower bounds for a ResNEst using corresponding bases. Our results demonstrate ResNEsts indeed have a problem of diminishing feature reuse; however, it can be avoided by sufficiently expanding or widening the input space, leading to the above-mentioned desirable property. Inspired by the DenseNets that have been shown to outperform ResNets, we also propose a corresponding new model called Densely connected Nonlinear Estimator (DenseNEst). We show that any DenseNEst can be represented as a wide ResNEst with bottleneck blocks. Unlike ResNEsts, DenseNEsts exhibit the desirable property without any special architectural re-design.
    On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay. (arXiv:2106.15739v3 [cs.LG] UPDATED)
    (0 min) Training neural networks with batch normalization and weight decay has become a common practice in recent years. In this work, we show that their combined use may result in a surprising periodic behavior of optimization dynamics: the training process regularly exhibits destabilizations that, however, do not lead to complete divergence but cause a new period of training. We rigorously investigate the mechanism underlying the discovered periodic behavior from both empirical and theoretical points of view and analyze the conditions in which it occurs in practice. We also demonstrate that periodic behavior can be regarded as a generalization of two previously opposing perspectives on training with batch normalization and weight decay, namely the equilibrium presumption and the instability presumption.
    Inferential Theory for Granular Instrumental Variables in High Dimensions. (arXiv:2201.06605v1 [econ.EM])
    (0 min) The Granular Instrumental Variables (GIV) methodology exploits panels with factor error structures to construct instruments to estimate structural time series models with endogeneity even after controlling for latent factors. We extend the GIV methodology in several dimensions. First, we extend the identification procedure to a large $N$ and large $T$ framework, which depends on the asymptotic Herfindahl index of the size distribution of $N$ cross-sectional units. Second, we treat both the factors and loadings as unknown and show that the sampling error in the estimated instrument and factors is negligible when considering the limiting distribution of the structural parameters. Third, we show that the sampling error in the high-dimensional precision matrix is negligible in our estimation algorithm. Fourth, we overidentify the structural parameters with additional constructed instruments, which leads to efficiency gains. Monte Carlo evidence is presented to support our asymptotic theory and application to the global crude oil market leads to new results.
    SGLB: Stochastic Gradient Langevin Boosting. (arXiv:2001.07248v5 [cs.LG] UPDATED)
    (0 min) This paper introduces Stochastic Gradient Langevin Boosting (SGLB) - a powerful and efficient machine learning framework that may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of the Langevin diffusion equation specifically designed for gradient boosting. This allows us to theoretically guarantee the global convergence even for multimodal loss functions, while standard gradient boosting algorithms can guarantee only local optimum. We also empirically show that SGLB outperforms classic gradient boosting when applied to classification tasks with 0-1 loss function, which is known to be multimodal.
    Chaining Value Functions for Off-Policy Learning. (arXiv:2201.06468v1 [cs.LG])
    (0 min) To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
    System-Agnostic Meta-Learning for MDP-based Dynamic Scheduling via Descriptive Policy. (arXiv:2201.07051v1 [cs.LG])
    (0 min) Dynamic scheduling is an important problem in applications from queuing to wireless networks. It addresses how to choose an item among multiple scheduling items in each timestep to achieve a long-term goal. Conventional approaches for dynamic scheduling find the optimal policy for a given specific system so that the policy from these approaches is usable only for the corresponding system characteristics. Hence, it is hard to use such approaches for a practical system in which system characteristics dynamically change. This paper proposes a novel policy structure for MDP-based dynamic scheduling, a descriptive policy, which has a system-agnostic capability to adapt to unseen system characteristics for an identical task (dynamic scheduling). To this end, the descriptive policy learns a system-agnostic scheduling principle--in a nutshell, "which condition of items should have a higher priority in scheduling". The scheduling principle can be applied to any system so that the descriptive policy learned in one system can be used for another system. Experiments with simple explanatory and realistic application scenarios demonstrate that it enables system-agnostic meta-learning with very little performance degradation compared with the system-specific conventional policies.
    Reconstruction of Incomplete Wildfire Data using Deep Generative Models. (arXiv:2201.06153v1 [stat.ML])
    (0 min) We present our submission to the Extreme Value Analysis 2021 Data Challenge in which teams were asked to accurately predict distributions of wildfire frequency and size within spatio-temporal regions of missing data. For the purpose of this competition we developed a variant of the powerful variational autoencoder models dubbed the Conditional Missing data Importance-Weighted Autoencoder (CMIWAE). Our deep latent variable generative model requires little to no feature engineering and does not necessarily rely on the specifics of scoring in the Data Challenge. It is fully trained on incomplete data, with the single objective to maximize log-likelihood of the observed wildfire information. We mitigate the effects of the relatively low number of training samples by stochastic sampling from a variational latent variable distribution, as well as by ensembling a set of CMIWAE models trained and validated on different splits of the provided data. The presented approach is not domain-specific and is amenable to application in other missing data recovery tasks with tabular or image-like information conditioned on auxiliary information.
    PyHHMM: A Python Library for Heterogeneous Hidden Markov Models. (arXiv:2201.06968v1 [cs.MS])
    (0 min) We introduce PyHHMM, an object-oriented open-source Python implementation of Heterogeneous-Hidden Markov Models (HHMMs). In addition to HMM's basic core functionalities, such as different initialization algorithms and classical observations models, i.e., continuous and multinoulli, PyHHMM distinctively emphasizes features not supported in similar available frameworks: a heterogeneous observation model, missing data inference, different model order selection criterias, and semi-supervised training. These characteristics result in a feature-rich implementation for researchers working with sequential data. PyHHMM relies on the numpy, scipy, scikit-learn, and seaborn Python packages, and is distributed under the Apache-2.0 License. PyHHMM's source code is publicly available on Github (https://github.com/fmorenopino/HeterogeneousHMM) to facilitate adoptions and future contributions. A detailed documentation (https://pyhhmm.readthedocs.io/en/latest), which covers examples of use and models' theoretical explanation, is available. The package can be installed through the Python Package Index (PyPI), via 'pip install pyhhmm'.
    Training Fair Deep Neural Networks by Balancing Influence. (arXiv:2201.05759v1 [cs.LG])
    (0 min) Most fair machine learning methods either highly rely on the sensitive information of the training samples or require a large modification on the target models, which hinders their practical application. To address this issue, we propose a two-stage training algorithm named FAIRIF. It minimizes the loss over the reweighted data set (second stage) where the sample weights are computed to balance the model performance across different demographic groups (first stage). FAIRIF can be applied on a wide range of models trained by stochastic gradient descent without changing the model, while only requiring group annotations on a small validation set to compute sample weights. Theoretically, we show that, in the classification setting, three notions of disparity among different groups can be mitigated by training with the weights. Experiments on synthetic data sets demonstrate that FAIRIF yields models with better fairness-utility trade-offs against various types of bias; and on real-world data sets, we show the effectiveness and scalability of FAIRIF. Moreover, as evidenced by the experiments with pretrained models, FAIRIF is able to alleviate the unfairness issue of pretrained models without hurting their performance.
    Who Increases Emergency Department Use? New Insights from the Oregon Health Insurance Experiment. (arXiv:2201.07072v1 [econ.EM])
    (0 min) We provide new insights into the finding that Medicaid increased emergency department (ED) use from the Oregon experiment. Using nonparametric causal machine learning methods, we find economically meaningful treatment effect heterogeneity in the impact of Medicaid coverage on ED use. The effect distribution is widely dispersed, with significant positive effects concentrated among high-use individuals. A small group - about 14% of participants - in the right tail with significant increases in ED use drives the overall effect. The remainder of the individualized treatment effects is either indistinguishable from zero or negative. The average treatment effect is not representative of the individualized treatment effect for most people. We identify four priority groups with large and statistically significant increases in ED use - men, prior SNAP participants, adults less than 50 years old, and those with pre-lottery ED use classified as primary care treatable. Our results point to an essential role of intensive margin effects - Medicaid increases utilization among those already accustomed to ED use and who use the emergency department for all types of care. We leverage the heterogeneous effects to estimate optimal assignment rules to prioritize insurance applications in similar expansions.
    Risk-Monotonicity in Statistical Learning. (arXiv:2011.14126v5 [cs.LG] UPDATED)
    (0 min) Acquisition of data is a difficult task in many applications of machine learning, and it is only natural that one hopes and expects the population risk to decrease (better performance) monotonically with increasing data points. It turns out, somewhat surprisingly, that this is not the case even for the most standard algorithms that minimize the empirical risk. Non-monotonic behavior of the risk and instability in training have manifested and appeared in the popular deep learning paradigm under the description of double descent. These problems highlight the current lack of understanding of learning algorithms and generalization. It is, therefore, crucial to pursue this concern and provide a characterization of such behavior. In this paper, we derive the first consistent and risk-monotonic (in high probability) algorithms for a general statistical learning setting under weak assumptions, consequently answering some questions posed by Viering et al. 2019 on how to avoid non-monotonic behavior of risk curves. We further show that risk monotonicity need not necessarily come at the price of worse excess risk rates. To achieve this, we derive new empirical Bernstein-like concentration inequalities of independent interest that hold for certain non-i.i.d.~processes such as Martingale Difference Sequences.
    Improving the quality control of seismic data through active learning. (arXiv:2201.06616v1 [stat.ML])
    (0 min) In image denoising problems, the increasing density of available images makes an exhaustive visual inspection impossible and therefore automated methods based on machine-learning must be deployed for this purpose. This is particulary the case in seismic signal processing. Engineers/geophysicists have to deal with millions of seismic time series. Finding the sub-surface properties useful for the oil industry may take up to a year and is very costly in terms of computing/human resources. In particular, the data must go through different steps of noise attenuation. Each denoise step is then ideally followed by a quality control (QC) stage performed by means of human expertise. To learn a quality control classifier in a supervised manner, labeled training data must be available, but collecting the labels from human experts is extremely time-consuming. We therefore propose a novel active learning methodology to sequentially select the most relevant data, which are then given back to a human expert for labeling. Beyond the application in geophysics, the technique we promote in this paper, based on estimates of the local error and its uncertainty, is generic. Its performance is supported by strong empirical evidence, as illustrated by the numerical experiments presented in this article, where it is compared to alternative active learning strategies both on synthetic and real seismic datasets.
    Posterior Adaptation With New Priors. (arXiv:2007.01386v3 [cs.LG] UPDATED)
    (2 min) Classification approaches based on the direct estimation and analysis of posterior probabilities will degrade if the original class priors begin to change. We prove that a unique (up to scale) solution is possible to recover the data likelihoods for a test example from its original class posteriors and dataset priors. Given the recovered likelihoods and a set of new priors, the posteriors can be re-computed using Bayes' Rule to reflect the influence of the new priors. The method is simple to compute and allows a dynamic update of the original posteriors.
    A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits. (arXiv:2201.06532v1 [cs.LG])
    (2 min) We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference of the expected cumulative reward of an agent choosing the optimal arm in every round and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the number of rounds and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$ and $N$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds.
    Incorporating Multiple Cluster Centers for Multi-Label Learning. (arXiv:2004.08113v3 [cs.LG] UPDATED)
    (2 min) Multi-label learning deals with the problem that each instance is associated with multiple labels simultaneously. Most of the existing approaches aim to improve the performance of multi-label learning by exploiting label correlations. Although the data augmentation technique is widely used in many machine learning tasks, it is still unclear whether data augmentation is helpful to multi-label learning. In this article, we propose to leverage the data augmentation technique to improve the performance of multi-label learning. Specifically, we first propose a novel data augmentation approach that performs clustering on the real examples and treats the cluster centers as virtual examples, and these virtual examples naturally embody the local label correlations and label importances. Then, motivated by the cluster assumption that examples in the same cluster should have the same label, we propose a novel regularization term to bridge the gap between the real examples and virtual examples, which can promote the local smoothness of the learning function. Extensive experimental results on a number of real-world multi-label datasets clearly demonstrate that our proposed approach outperforms the state-of-the-art counterparts.
    Minimax risk classifiers with 0-1 loss. (arXiv:2201.06487v1 [stat.ML])
    (2 min) Supervised classification techniques use training samples to learn a classification rule with small expected 0-1-loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1-loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0-1-loss over general classification rules and provide tight performance guarantees at learning. We show that MRCs are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees
    On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation. (arXiv:2201.06169v1 [math.ST])
    (2 min) We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of $Q$-function estimation is well-posed in the sense of $L^2$-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor $\gamma$ imposed in the recent literature for obtaining the $L^2$ convergence rates of various $Q$-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of $Q$-function and its derivatives in both sup-norm and $L^2$-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for $Q$-function but also efficient estimation on the value of any target policy in off-policy settings.
    Observing how deep neural networks understand physics through the energy spectrum of one-dimensional quantum mechanics. (arXiv:2201.06676v1 [physics.comp-ph])
    (2 min) We investigated how neural networks (NNs) understand physics using one-dimensional quantum mechanics. After training an NN to accurately predict energy eigenvalues from potentials, we used it to confirm the NN's understanding of physics from four different aspects. The trained NN could predict energy eigenvalues of a different potential than the one learned, focus on minima and maxima of a potential, predict the probability distribution of the existence of particles not used during training, and reproduce untrained physical phenomena. These results show that NNs can learn the laws of physics from only a limited set of data, predict the results of experiments under conditions different from those used for training, and predict physical quantities of types not provided during training. Since NNs understand physics through a different path than humans take, and by complementing the human way of understanding, they will be a powerful tool for advancing physics.
    Differentiable Rule Induction with Learned Relational Features. (arXiv:2201.06515v1 [stat.ML])
    (2 min) Rule-based decision models are attractive due to their interpretability. However, existing rule induction methods often results in long and consequently less interpretable set of rules. This problem can, in many cases, be attributed to the rule learner's lack of appropriately expressive vocabulary, i.e., relevant predicates. Most existing rule induction algorithms presume the availability of predicates used to represent the rules, naturally decoupling the predicate definition and the rule learning phases. In contrast, we propose the Relational Rule Network (RRN), a neural architecture that learns relational predicates that represent a linear relationship among attributes along with the rules that use them. This approach opens the door to increasing the expressiveness of induced decision models by coupling predicate learning directly with rule learning in an end to end differentiable fashion. On benchmark tasks, we show that these relational predicates are simple enough to retain interpretability, yet improve prediction accuracy and provide sets of rules that are more concise compared to state of the art rule induction algorithms.
    Online Time Series Anomaly Detection with State Space Gaussian Processes. (arXiv:2201.06763v1 [cs.LG])
    (2 min) We propose r-ssGPFA, an unsupervised online anomaly detection model for uni- and multivariate time series building on the efficient state space formulation of Gaussian processes. For high-dimensional time series, we propose an extension of Gaussian process factor analysis to identify the common latent processes of the time series, allowing us to detect anomalies efficiently in an interpretable manner. We gain explainability while speeding up computations by imposing an orthogonality constraint on the mapping from the latent to the observed. Our model's robustness is improved by using a simple heuristic to skip Kalman updates when encountering anomalous observations. We investigate the behaviour of our model on synthetic data and show on standard benchmark datasets that our method is competitive with state-of-the-art methods while being computationally cheaper.
    Towards Sample-efficient Overparameterized Meta-learning. (arXiv:2201.06142v1 [cs.LG])
    (2 min) An overarching goal in machine learning is to build a generalizable model with few samples. To this end, overparameterization has been the subject of immense interest to explain the generalization ability of deep nets even when the size of the dataset is smaller than that of the model. While the prior literature focuses on the classical supervised setting, this paper aims to demystify overparameterization for meta-learning. Here we have a sequence of linear-regression tasks and we ask: (1) Given earlier tasks, what is the optimal linear representation of features for a new downstream task? and (2) How many samples do we need to build this representation? This work shows that surprisingly, overparameterization arises as a natural answer to these fundamental meta-learning questions. Specifically, for (1), we first show that learning the optimal representation coincides with the problem of designing a task-aware regularization to promote inductive bias. We leverage this inductive bias to explain how the downstream task actually benefits from overparameterization, in contrast to prior works on few-shot learning. For (2), we develop a theory to explain how feature covariance can implicitly help reduce the sample complexity well below the degrees of freedom and lead to small estimation error. We then integrate these findings to obtain an overall performance guarantee for our meta-learning algorithm. Numerical experiments on real and synthetic data verify our insights on overparameterized meta-learning.
    Tolerance and Prediction Intervals for Non-normal Models. (arXiv:2011.11583v5 [stat.ME] UPDATED)
    (2 min) A prediction interval covers a future observation from a random process in repeated sampling, and is typically constructed by identifying a pivotal quantity that is also an ancillary statistic. Analogously, a tolerance interval covers a population percentile in repeated sampling and is often based on a pivotal quantity. One approach we consider in non-normal models leverages a link function resulting in a pivotal quantity that is approximately normally distributed. In settings where this normal approximation does not hold we consider a second approach for tolerance and prediction based on a confidence interval for the mean. These methods are intuitive, simple to implement, have proper operating characteristics, and are computationally efficient compared to Bayesian, re-sampling, and machine learning methods. This is demonstrated in the context of multi-site clinical trial recruitment with staggered site initiation, real-world time on treatment, and end-of-study success for a clinical endpoint.
    Hierarchical VAEs Know What They Don't Know. (arXiv:2102.08248v7 [cs.LG] UPDATED)
    (2 min) Deep generative models have been demonstrated as state-of-the-art density estimators. Yet, recent work has found that they often assign a higher likelihood to data from outside the training distribution. This seemingly paradoxical behavior has caused concerns over the quality of the attained density estimates. In the context of hierarchical variational autoencoders, we provide evidence to explain this behavior by out-of-distribution data having in-distribution low-level features. We argue that this is both expected and desirable behavior. With this insight in hand, we develop a fast, scalable and fully unsupervised likelihood-ratio score for OOD detection that requires data to be in-distribution across all feature-levels. We benchmark the method on a vast set of data and model combinations and achieve state-of-the-art results on out-of-distribution detection.
    Efficient Hyperparameter Tuning for Large Scale Kernel Ridge Regression. (arXiv:2201.06314v1 [cs.LG])
    (2 min) Kernel methods provide a principled approach to nonparametric learning. While their basic implementations scale poorly to large problems, recent advances showed that approximate solvers can efficiently handle massive datasets. A shortcoming of these solutions is that hyperparameter tuning is not taken care of, and left for the user to perform. Hyperparameters are crucial in practice and the lack of automated tuning greatly hinders efficiency and usability. In this paper, we work to fill in this gap focusing on kernel ridge regression based on the Nystr\"om approximation. After reviewing and contrasting a number of hyperparameter tuning strategies, we propose a complexity regularization criterion based on a data dependent penalty, and discuss its efficient optimization. Then, we proceed to a careful and extensive empirical evaluation highlighting strengths and weaknesses of the different tuning strategies. Our analysis shows the benefit of the proposed approach, that we hence incorporate in a library for large scale kernel methods to derive adaptively tuned solutions.
    Contrastive Regularization for Semi-Supervised Learning. (arXiv:2201.06247v1 [cs.LG])
    (2 min) Consistency regularization on label predictions becomes a fundamental technique in semi-supervised learning, but it still requires a large number of training iterations for high performance. In this study, we analyze that the consistency regularization restricts the propagation of labeling information due to the exclusion of samples with unconfident pseudo-labels in the model updates. Then, we propose contrastive regularization to improve both efficiency and accuracy of the consistency regularization by well-clustered features of unlabeled data. In specific, after strongly augmented samples are assigned to clusters by their pseudo-labels, our contrastive regularization updates the model so that the features with confident pseudo-labels aggregate the features in the same cluster, while pushing away features in different clusters. As a result, the information of confident pseudo-labels can be effectively propagated into more unlabeled samples during training by the well-clustered features. On benchmarks of semi-supervised learning tasks, our contrastive regularization improves the previous consistency-based methods and achieves state-of-the-art results, especially with fewer training iterations. Our method also shows robust performance on open-set semi-supervised learning where unlabeled data includes out-of-distribution samples.
    Faster Rates of Private Stochastic Convex Optimization. (arXiv:2108.00331v3 [cs.LG] UPDATED)
    (2 min) In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $\theta>1$. Specifically, we first show that under some mild assumptions on the loss functions, there is an algorithm whose output could achieve an upper bound of $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log \frac{1}{\delta}}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $(\epsilon, \delta)$-DP when $\theta\geq 2$, here $n$ is the sample size and $d$ is the dimension of the space. Then we address the inefficiency issue, improve the upper bounds by $\text{Poly}(\log n)$ factors and extend to the case where $\theta\geq \bar{\theta}>1$ for some known $\bar{\theta}$. Next we show that the excess population risk of population functions satisfying TNC with parameter $\theta\geq 2$ is always lower bounded by $\Omega((\frac{d}{n\epsilon})^\frac{\theta}{\theta-1}) $ and $\Omega((\frac{\sqrt{d\log \frac{1}{\delta}}}{n\epsilon})^\frac{\theta}{\theta-1})$ for $\epsilon$-DP and $(\epsilon, \delta)$-DP, respectively. In the second part, we focus on a special case where the population risk function is strongly convex. Unlike the previous studies, here we assume the loss function is {\em non-negative} and {\em the optimal value of population risk is sufficiently small}. With these additional assumptions, we propose a new method whose output could achieve an upper bound of $O(\frac{d\log\frac{1}{\delta}}{n^2\epsilon^2}+\frac{1}{n^{\tau}})$ for any $\tau\geq 1$ in $(\epsilon,\delta)$-DP model if the sample size $n$ is sufficiently large.
    (Almost) All of Entity Resolution. (arXiv:2008.04443v3 [stat.ME] UPDATED)
    (2 min) Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
    Selecting time-series hyperparameters with the artificial jackknife. (arXiv:2002.04697v4 [stat.ME] UPDATED)
    (2 min) This article proposes a generalisation of the delete-$d$ jackknife to solve hyperparameter selection problems for time series. I call it artificial delete-$d$ jackknife to stress that this approach substitutes the classic removal step with a fictitious deletion, wherein observed datapoints are replaced with artificial missing values. This procedure keeps the data order intact and allows plain compatibility with time series. This manuscript shows a simple illustration in which it is applied to regulate high-dimensional elastic-net vector autoregressive moving average (VARMA) models.
    Robust Aggregation for Federated Learning. (arXiv:1912.13445v2 [stat.ML] UPDATED)
    (2 min) Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of iterations of a regular non-robust averaging oracle. The robust aggregation oracle is privacy-preserving, similar to the non-robust secure average oracle it builds upon. We establish its convergence for least squares estimation of additive models. We provide experimental results with linear models and deep networks for three tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption. Two variants, a faster one with one-step robust aggregation and another one with on-device personalization, round off the paper.
    Bayesian Calibration of imperfect computer models using Physics-informed priors. (arXiv:2201.06463v1 [stat.ML])
    (2 min) In this work we introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters of computer models, represented by differential equations. We construct physics-informed priors for differential equations, which are multi-output Gaussian process (GP) priors that encode the model's structure in the covariance function. We extend this into a fully Bayesian framework which allows quantifying the uncertainty of physical parameters and model predictions. Since physical models are usually imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference Hamiltonian Monte Carlo (HMC) sampling is used. This work is motivated by the need for interpretable parameters for the hemodynamics of the heart for personal treatment of hypertension. The model used is the arterial Windkessel model, which represents the hemodynamics of the heart through differential equations with physically interpretable parameters of medical interest. As most physical models, the Windkessel model is an imperfect description of the real process. To demonstrate our approach we simulate noisy data from a more complex physical model with known mathematical connections to our modeling choice. We show that without accounting for discrepancy, the posterior of the physical parameters deviates from the true value while when accounting for discrepancy gives reasonable quantification of physical parameters uncertainty and reduces the uncertainty in subsequent model predictions.
    Risk bounds for PU learning under Selected At Random assumption. (arXiv:2201.06277v1 [math.ST])
    (2 min) Positive-unlabeled learning (PU learning) is known as a special case of semi-supervised binary classification where only a fraction of positive examples are labeled. The challenge is then to find the correct classifier despite this lack of information. Recently, new methodologies have been introduced to address the case where the probability of being labeled may depend on the covariates. In this paper, we are interested in establishing risk bounds for PU learning under this general assumption. In addition, we quantify the impact of label noise on PU learning compared to standard classification setting. Finally, we provide a lower bound on minimax risk proving that the upper bound is almost optimal.
    Low Regret Binary Sampling Method for Efficient Global Optimization of Univariate Functions. (arXiv:2201.07164v1 [cs.LG])
    (2 min) In this work, we propose a computationally efficient algorithm for the problem of global optimization in univariate loss functions. For the performance evaluation, we study the cumulative regret of the algorithm instead of the simple regret between our best query and the optimal value of the objective function. Although our approach has similar regret results with the traditional lower-bounding algorithms such as the Piyavskii-Shubert method for the Lipschitz continuous or Lipschitz smooth functions, it has a major computational cost advantage. In Piyavskii-Shubert method, for certain types of functions, the query points may be hard to determine (as they are solutions to additional optimization problems). However, this issue is circumvented in our binary sampling approach, where the sampling set is predetermined irrespective of the function characteristics. For a search space of $[0,1]$, our approach has at most $L\log (3T)$ and $2.25H$ regret for $L$-Lipschitz continuous and $H$-Lipschitz smooth functions respectively. We also analytically extend our results for a broader class of functions that covers more complex regularity conditions.
    Theoretical analysis and computation of the sample Frechet mean for sets of large graphs based on spectral information. (arXiv:2201.05923v1 [stat.ML])
    (2 min) To characterize the location (mean, median) of a set of graphs, one needs a notion of centrality that is adapted to metric spaces, since graph sets are not Euclidean spaces. A standard approach is to consider the Frechet mean. In this work, we equip a set of graphs with the pseudometric defined by the norm between the eigenvalues of their respective adjacency matrix. Unlike the edit distance, this pseudometric reveals structural changes at multiple scales, and is well adapted to studying various statistical problems for graph-valued data. We describe an algorithm to compute an approximation to the sample Frechet mean of a set of undirected unweighted graphs with a fixed size using this pseudometric.
    Transport away your problems: Calibrating stochastic simulations with optimal transport. (arXiv:2107.08648v2 [physics.data-an] UPDATED)
    (2 min) Stochastic simulators are an indispensable tool in many branches of science. Often based on first principles, they deliver a series of samples whose distribution implicitly defines a probability measure to describe the phenomena of interest. However, the fidelity of these simulators is not always sufficient for all scientific purposes, necessitating the construction of ad-hoc corrections to "calibrate" the simulation and ensure that its output is a faithful representation of reality. In this paper, we leverage methods from transportation theory to construct such corrections in a systematic way. We use a neural network to compute minimal modifications to the individual samples produced by the simulator such that the resulting distribution becomes properly calibrated. We illustrate the method and its benefits in the context of experimental particle physics, where the need for calibrated stochastic simulators is particularly pronounced.
    Tight Regret Bounds for Noisy Optimization of a Brownian Motion. (arXiv:2001.09327v2 [cs.LG] UPDATED)
    (2 min) We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise. We show that as the smallest possible expected cumulative regret and the smallest possible expected simple regret scale as $\Omega(\sigma\sqrt{T / \log (T)}) \cap \mathcal{O}(\sigma\sqrt{T} \cdot \log T)$ and $\Omega(\sigma / \sqrt{T \log (T)}) \cap \mathcal{O}(\sigma\log T / \sqrt{T})$ respectively, where $\sigma^2$ is the noise variance. Thus, our upper and lower bounds are tight up to a factor of $\mathcal{O}( (\log T)^{1.5} )$. The upper bound uses an algorithm based on confidence bounds and the Markov property of Brownian motion (among other useful properties), and the lower bound is based on a reduction to binary hypothesis testing.
    Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence. (arXiv:2106.03743v5 [cs.LG] UPDATED)
    (2 min) We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity. To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.
    Targeted VAE: Variational and Targeted Learning for Causal Inference. (arXiv:2009.13472v5 [stat.ML] UPDATED)
    (2 min) Undertaking causal inference with observational data is incredibly useful across a wide range of tasks including the development of medical treatments, advertisements and marketing, and policy making. There are two significant challenges associated with undertaking causal inference using observational data: treatment assignment heterogeneity (\textit{i.e.}, differences between the treated and untreated groups), and an absence of counterfactual data (\textit{i.e.}, not knowing what would have happened if an individual who did get treatment, were instead to have not been treated). We address these two challenges by combining structured inference and targeted learning. In terms of structure, we factorize the joint distribution into risk, confounding, instrumental, and miscellaneous factors, and in terms of targeted learning, we apply a regularizer derived from the influence curve in order to reduce residual bias. An ablation study is undertaken, and an evaluation on benchmark datasets demonstrates that TVAE has competitive and state of the art performance.
    Quantum Ensemble for Classification. (arXiv:2007.01028v3 [cs.LG] UPDATED)
    (2 min) A powerful way to improve performance in machine learning is to construct an ensemble that combines the predictions of multiple models. Ensemble methods are often much more accurate and lower variance than the individual classifiers that make them up but have high requirements in terms of memory and computational time. In fact, a large number of alternative algorithms is usually adopted, each requiring to query all available data. We propose a new quantum algorithm that exploits quantum superposition, entanglement and interference to build an ensemble of classification models. Thanks to the generation of the several quantum trajectories in superposition, we obtain $B$ transformations of the quantum state which encodes the training set in only $log\left(B\right)$ operations. This implies exponential growth of the ensemble size while increasing linearly the depth of the correspondent circuit. Furthermore, when considering the overall cost of the algorithm, we show that the training of a single weak classifier impacts additively the overall time complexity rather than multiplicatively, as it usually happens in classical ensemble methods. We also present small-scale experiments on real-world datasets, defining a quantum version of the cosine classifier and using the IBM qiskit environment to show how the algorithms work.
    Matrix Reordering for Noisy Disordered Matrices: Optimality and Computationally Efficient Algorithms. (arXiv:2201.06438v1 [math.ST])
    (2 min) Motivated by applications in single-cell biology and metagenomics, we consider matrix reordering based on the noisy disordered matrix model. We first establish the fundamental statistical limit for the matrix reordering problem in a decision-theoretic framework and show that a constrained least square estimator is rate-optimal. Given the computational hardness of the optimal procedure, we analyze a popular polynomial-time algorithm, spectral seriation, and show that it is suboptimal. We then propose a novel polynomial-time adaptive sorting algorithm with guaranteed improvement on the performance. The superiority of the adaptive sorting algorithm over the existing methods is demonstrated in simulation studies and in the analysis of two real single-cell RNA sequencing datasets.
    A three layer neural network can represent any multivariate function. (arXiv:2012.03016v2 [cs.LG] UPDATED)
    (2 min) In 1987, Hecht-Nielsen showed that any continuous multivariate function can be implemented by a certain type three-layer neural network. This result was very much discussed in neural network literature. In this paper we prove that not only continuous functions but also all discontinuous functions can be implemented by such neural networks.
    Mixed Logit Models and Network Formation. (arXiv:2006.16516v3 [cs.SI] UPDATED)
    (2 min) The study of network formation is pervasive in economics, sociology, and many other fields. In this paper, we model network formation as a `choice' that is made by nodes in a network to connect to other nodes. We study these `choices' using discrete-choice models, in which an agent chooses between two or more discrete alternatives. We employ the `repeated-choice' (RC) model to study network formation. We argue that the RC model overcomes important limitations of the multinomial logit (MNL) model, which gives one framework for studying network formation, and that it is well-suited to study network formation. We also illustrate how to use the RC model to accurately study network formation using both synthetic and real-world networks. Using synthetic networks, we also compare the performance of the MNL model and the RC model. We find that the RC model estimates the data-generation process of our synthetic networks more accurately than the MNL model. We do a case study of a qualitatively interesting scenario -- the fact that new patents are more likely to cite older, more cited, and similar patents -- for which the RC model allows us to achieve insights.
    Convolutional Signature for Sequential Data. (arXiv:2009.06719v2 [stat.ML] UPDATED)
    (2 min) Signature is an infinite graded sequence of statistics known to characterize geometric rough paths, which includes the paths with bounded variation. This object has been studied successfully for machine learning with mostly applications in low dimensional cases. In the high dimensional case, it suffers from exponential growth in the number of features in truncated signature transform. We propose a novel neural network based model which borrows the idea from Convolutional Neural Network to address this problem. Our model reduces the number of features efficiently in a data dependent way. Some empirical experiments are provided to support our model.

2022-01-19

  • The Official NVIDIA Blog

    New NVIDIA AI Enterprise Release Lights Up Data Centers
    (5 min) With a new year underway, NVIDIA is helping enterprises worldwide add modern workloads to their mainstream servers using the latest release of the NVIDIA AI Enterprise software suite. NVIDIA AI Enterprise 1.1 is now generally available. Optimized, certified and supported by NVIDIA, the latest version of the software suite brings new updates including production support Read article > The post New NVIDIA AI Enterprise Release Lights Up Data Centers appeared first on The Official NVIDIA Blog.
    Fusing Art and Tech: MORF Gallery CEO Scott Birnbaum on Digital Paintings, NFTs and More
    (3 min) Browse through MORF Gallery — virtually or at an in-person exhibition — and you’ll find robots that paint, digital dreamscape experiences, and fine art brought to life by visual effects. The gallery showcases cutting-edge, one-of-a-kind artwork from award-winning artists who fuse their creative skills with AI, machine learning, robotics and neuroscience. Scott Birnbaum, CEO and Read article > The post Fusing Art and Tech: MORF Gallery CEO Scott Birnbaum on Digital Paintings, NFTs and More appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    When should someone trust an AI assistant’s predictions?
    (7 min) Researchers have created a method to help workers collaborate with artificial intelligence systems.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Voice-Based Gender Identification using SVM
    (10 min) Let’s find out whether this machine learning model is able to distinguish male and female voices.
    Is AI taking quality and cost optimization of enterprise services to the next level?
    (6 min) Dr Adrian Engelbrecht, Product & Development Lead, Serviceware AI, looks at how AI is taking quality and cost optimization of enterprise services to the next level. Business and service leaders are…
    Mining Associations & Frequent Patterns
    (7 min) Imagine that you are a sales manager at Big Electronics Shop, and you are talking to a customer who recently bought a PC and a digital…
  • John D. Cook

    Upper bound on wait time for general queues
    (1 min) The previous post presented an approximation for the steady-state wait time in queue with a general probability distribution for inter-arrival times and for service times, i.e. a G/G/s queue where s is the number of servers. This post will give an upper bound for the wait time. Let σ²A be the variance on the inter-arrival […] Upper bound on wait time for general queues first appeared on John D. Cook.
  • Microsoft Research

    DeepSpeed: Advancing MoE inference and training to power next-generation AI scale
    (21 min) In the last three years, the largest trained dense models have increased in size by over 1,000 times, from a few hundred million parameters to over 500 billion parameters in Megatron-Turing NLG 530B (MT-NLG). Improvements in model quality with size suggest that this trend will continue, with larger model sizes bringing better model quality. However, […] The post DeepSpeed: Advancing MoE inference and training to power next-generation AI scale appeared first on Microsoft Research.
  • Artificial Intelligence

    Is there a AI, which is able to generate realistic images, no matter from what?
    (0 min) submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Innovating Meaningful & Impactful Health System Transformation - Fanny Sie - One Roche Head of Artificial Intelligence and Digital Health, F. Hoffmann La Roche
    (1 min) submitted by /u/ObjectiveGround5 [link] [comments]
    What an Ai thinks the Broken Peace is.
    (1 min) submitted by /u/Crow19852 [link] [comments]
    Researchers Introduce ‘SeMask’: An Effective Transformer-Framework That Incorporates Semantic Information Into The Encoder With The Help Of A Semantic Attention Operation
    (1 min) After demonstrating the transformer’s efficiency in the visual domain, the research community has focused on extending its use to several fields. One of these is semantic segmentation, a critical application for many areas, such as autonomous drive or medical diagnosis. The classical approach to this topic has been to use an existing pre-trained Transformer Layer as an encoder, tuning it for the segmentation task. However, this approach lacks insight into the semantic context during fine-tuning due to the relatively small dataset size compared to the one used for pre-training. The Picsart AI Research team, together with the SHI Lab and IIT Roorkee, proposed the SeMask framework to incorporate semantic information into a generic hierarchical vision transformer architecture (such as Swin Transformer) through two techniques. First, a Semantic Layer is added after the Transformer Layer; second, two decoders are used: a lightweight semantic decoder used only for training and a feature decoder. The comparison of the existing methods general pipeline and SeMask is shown in the image below. Paper Summary: https://www.marktechpost.com/2022/01/19/researchers-introduce-semask-an-effective-transformer-framework-that-incorporates-semantic-information-into-the-encoder-with-the-help-of-a-semantic-attention-operation/ Paper: https://arxiv.org/abs/2112.12782 Github: https://github.com/Picsart-AI-Research/SeMask-Segmentation ​ https://preview.redd.it/cw5v6fsfzoc81.png?width=1206&format=png&auto=webp&s=0c171b2edc18d98a684f3bfc0882bf4f80383da8 submitted by /u/techsucker [link] [comments]
    How AI Can Solve Customer Behavior Analysis
    (0 min) submitted by /u/Visionifyai [link] [comments]
    [R] Less is More: Understanding Neural Network Decisions via Simplified Yet Informative Inputs
    (1 min) A research team from University Medical Center Freiburg, ML Collective, and Google Brain introduces SimpleBits — an information-reduction method that learns to synthesize simplified inputs that contain less information yet remain informative for the task, providing a new approach for exploring the basis of network decisions. Here is a quick read: Less is More: Understanding Neural Network Decisions via Simplified Yet Informative Inputs. The paper When Less is More: Simplifying Inputs Aids Neural Network Understanding is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Papers With Code's Coolest AI Publication of 2021 Explained: ADOP - Create Smooth Videos from Images!
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Top 10 Machine Translation APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    What AIs are there which are able to generate images which look realistic?
    (1 min) There are so many cool AIs which can make art, or edit images now I am asking myself if it is possible for AI to generate realistic images no matter from what. 2 Question: Is there a list with all AIs out there and what they can do? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    What AI could I use to turn images from faces into drawings/sketches?
    (1 min) The AI could coast money. submitted by /u/xXLisa28Xx [link] [comments]
    https://www.youtube.com/watch?v=mSRrtr497do
    (0 min) submitted by /u/Thenamessd [link] [comments]
    CALTECH CMTE AI ML BOOTCAMP
    (0 min) I’m looking into enrolling in this bootcamp: https://pg-p.ctme.caltech.edu/online-ai-machine-learning-bootcamp-live-training. Has anybody taken this bootcamp? Any thoughts? The modules are thought by CALTECH and IBM. submitted by /u/adraxediem [link] [comments]
    Alan Watts predicted Neuralink over 60 years ago. (Awesome video)
    (0 min) submitted by /u/Wisdom369 [link] [comments]
    Artificial Nightmares : Shrek Is Human || Pytti VQGAN AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Artificial General Intelligence Conference in 2022
    (2 min) submitted by /u/akolonin [link] [comments]
    Generating Natural Language Training Data from Physics Simulations in Unity
    (0 min) submitted by /u/itsallshit-eatup [link] [comments]
    Radiohead - Pyramid Song - AI Video Art
    (0 min) submitted by /u/glenniszen [link] [comments]
  • Machine Learning

    [D] Big differences between swarm learning and federated learning
    (1 min) A fairly new paper in Nature (https://www.nature.com/articles/s41586-021-03583-3) in 2021 introduces a new framework that can deal with decentralized data called swarm learning. They said that the biggest innovation of swarm learning is that there is no central node that aggregates the data. However, according to their supplementary, at the time when they "share the learning from the individual models, one of the swarm nodes is dynamically elected as a leader...the elected leader node collects the model parameters from each peer node and merges them... the nodes use the merged parameters to start the next training batch". In my opinion, although swarm learning does not have an explicit central node, each time when the model needs to share the parameters one of the local nodes is chosen as a center, which can be seen as an implicit central node. Besides, I don't know how to keep the privacy in this setting because one local node will receive all the parameters of the other nodes, not the average of the parameters as the standard federated learning setting. Also, the authors use the blockchain as the network and say that blockchain smart contracts prevent attacks from semi-honest or dishonest participants. I am not very familiar with the blockchain so I am a little bit confused about how it works in detail and what the advantage of the blockchain is. submitted by /u/mike_shen [link] [comments]
    [D] How to train a NeRF in seconds explained - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding - 5-minute paper summary (by Casual GAN Papers)
    (1 min) If you liked the 100x NeRF speed up from a month ago, you definitely will love this fresh new way to train NeRF 1000x faster proposed in a paper by Thomas Müller and the team at Nvidia that utilizes a custom data structure for input encoding that is implemented as CUDA kernels highly optimized for the modern GPUs. Specifically, the authors propose to learn a multiresolution hashtable that maps the query coordinates to feature vectors. The encoded input feature vectors are passed through a small MLP to predict the color and density of a point in the scene, NeRF-style. How does this help the model to fit entire scenes in seconds? Let’s learn! Full summary: https://t.me/casual_gan/239 Blog post: https://www.casualganpapers.com/fastest_nerf_3d_neural_rendering/Instant-Neural-Graphics-Primitives-explained.html Instant NeRF arxiv / code Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]
    [R] AutoInit Software for Model Initialization
    (1 min) AutoInit is a weight initialization method that automatically adapts to different neural network architectures. It tracks the mean and variance of signals as they propagate through the network and initializes the weights at each layer to avoid exploding or vanishing signals. AutoInit can be used to improve performance of feedforward, convolutional, and residual networks; configured with different activation function, dropout, weight decay, learning rate, and normalizer settings; and applied to vision, language, tabular, multi-task, and transfer learning domains. The software package provides a simple wrapper that makes it possible to apply AutoInit to existing TensorFlow models as-is. We invite you to try it out and see if it can improve the performance of your neural network models! For further details, see - GitHub repo: https://github.com/cognizant-ai-labs/autoinit - arXiv paper: https://arxiv.org/abs/2109.08958 AutoInit is also available through the Cognizant AI Labs Software page, together with related software on estimating model uncertainty, multitasking, loss-function metalearning, decision making, and model management, at - https://evolution.ml/software -- Garrett Bingham & Risto Miikkulainen submitted by /u/thedeepevolution [link] [comments]
    [D] ICML abstract deadline vs ICLR results date
    (1 min) This year, the ICML abstract deadline (Jan 20th) is before ICLR results come out (Jan 24th). Does anyone know if this was done purposefully to prevent resubmissions of rejected papers from ICLR? FWIW I don't think it's a bad idea, just curious if there was some public discussion about this. For context, last year the ICML abstract deadline was 2 weeks after ICLR results came out. submitted by /u/lit_turtle_man [link] [comments]
    [R] Does anyone have any experience renting out a whole Nvidia A100 bare metal? Who did you go through? What was your experience like?
    (1 min) Just like the title says, looking to hear from anyone who has rented a bare metal A100 before and wondering how it went and why you decided to go with bare metal - thanks! submitted by /u/After-Surfree [link] [comments]
    [R] Accelerating ML-based analytics queries with indexes
    (1 min) Running queries over unstructured data? Our new work on indexes, TASTI, can accelerate your queries by up to 20x! We describe our work in part 5 of our blog post series on accelerating queries over unstructured data and has been accepted to SIGMOD 2022: https://dawn.cs.stanford.edu/2022/01/18/tasti/ (full paper: https://arxiv.org/abs/2009.04540) TASTI does this by leveraging redundancy in datasets: it labels a small amount of data and learns distances from unlabeled data points to the labeled ones. It can then generate proxy scores per-query without training a new model! See our blog post for full details: https://dawn.cs.stanford.edu/2022/01/18/tasti/ submitted by /u/danieldkang [link] [comments]
    [D] Did you also feel that Snorkel's LabelModel is really slow?
    (0 min) Has anyone here used Snorkel AI's LabelModel for automatically labeling text? Have you found it to be super slow? submitted by /u/freaky_eater [link] [comments]
    [D] Predict video frames using Generative Adversarial Networks (GANs)
    (1 min) So, I have a project where it is desirable to predict time sequence of images (like frames of a video) as output of a neural network starting from an another time sequence of different images. From what I understood, I cannot use VAE as the input images and output images are representation of different things (like cats and dogs). I was investigating the use of GAN for this purposes. However, I never implemented something like this before (and I read training GANs can be very painful and unstable). Checking some works , I have seen this is possible in practice starting from past frames as inputs. However, this is not my case. Do you have any suggestion on how to breakdown the problem? submitted by /u/Betelgeuse999 [link] [comments]
    [D] First Author Interview - Noether Networks: Meta-Learning Useful Conserved Quantities (Paper Explained Video & Interview)
    (1 min) https://youtu.be/Xp3jR-ttMfo This video includes an interview with first author Ferran Alet! Encoding inductive biases has been a long established methods to provide deep networks with the ability to learn from less data. Especially useful are encodings of symmetry properties of the data, such as the convolution's translation invariance. But such symmetries are often hard to program explicitly, and can only be encoded exactly when done in a direct fashion. Noether Networks use Noether's theorem connecting symmetries to conserved quantities and are able to dynamically and approximately enforce symmetry properties upon deep neural networks. ​ OUTLINE: 0:00 - Intro & Overview 18:10 - Interview Start 21:20 - Symmetry priors vs conserved quantities 23:25 - Example: Pendulum 27:45 - Noether Network Model Overview 35:35 - Optimizing the Noether Loss 41:00 - Is the computation graph stable? 46:30 - Increasing the inference time computation 48:45 - Why dynamically modify the model? 55:30 - Experimental Results & Discussion ​ Paper: https://arxiv.org/abs/2112.03321 Website: https://dylandoblar.github.io/noether-networks/ Code: https://github.com/dylandoblar/noether-networks submitted by /u/ykilcher [link] [comments]
    [D] Generalization in time-series data and the underlying assumptions of ML models.
    (1 min) I've been refreshing some of my groundwork theoretical knowledge on statistics and generalization; So basically all "classical" statistical models assume that the data the model would be predicting comes from the same distribution as the data the model was trained on. This in turn gives us concrete figures on the generalization of our model. But what about time-series? Time-series datasets are inherently not drawn from the same distribution. So I suppose this means I can't have figures on generalization? How about applying models that assume a certain distribution of the training dataset (like Linear Regression)? submitted by /u/bloodmummy [link] [comments]
    [P] Can't finish my master's thesis. What to do?
    (10 min) I am not sure if that is the place to do this post. But there it goes. I am on the last semester of my Ma degree in AI. I am still working on some courses but I will probably have no problem finishing them. So I ll be left only with the thesis to work on. I have less than one month to finish, but it is possible to ask for an extension of 3 months.An important thing is that I am doing that in cooperation with a startup. Because of that my 'real' advisor is my boss at this startup. I have exchanged only a couple of emails with my advisor from university. The problem is that I am stuck in my project. It is about generating synthetic data for training neural networks that recognize numbers in the back of soccer players. But I feel that I am wasting time in a proposal that will never work, be…
    [D]Where to find current research on rating systems
    (1 min) Hello /r/MachineLearning, I have a strong interest in learning more about the state of the art for rating systems. I know about simple ones such as Elo, and Bradley Terry models, as well as some more advanced ones like Glicko, TrueSkill, WHR, and Melo, as well as some types of learning to rank models that can be applied to rating. I was wondering if there is some terminology for this type of thing or any central location where research in this area is published like a certain conference or journal. I have found what I know so far among stats journals usually mentioning "paired comparisons", some at ML conferences like Neurips and some just on arxiv or online. My best resource so far to find new works in this area is to go on google scholar and try to find the most recent papers which cite TrueSkill or Glicko. Does anyone here know a good way to find the most recent research in this niche area? Thanks! submitted by /u/cthorrez [link] [comments]
    [P] I made a software package for the N-Beats neural network architecture in keras
    (1 min) Hi everyone! I work in time series and forecasting a lot at work, and recently developed a user-friendly version of the N-Beats architecture with keras, since I didn't think there were good choices here. You can see the documentation here: https://kerasbeats.readthedocs.io/en/latest/ The package provides a high level wrapper over a keras model, but also gives you the ability to work directly with the model layer if you so please. I don't know that this is going to set the world of DL on fire, but I've been using it at work already with good results, and it provides an easy way to configure the model architecture defined in the paper here: https://arxiv.org/abs/1905.10437. submitted by /u/jonathanbechtel [link] [comments]
    [Discussion] Good software to review annotations from a CSV file ;
    (1 min) Hi! I am annotating pictures at VGG VIA which works great. But it is difficult to double check my annotations. In other softwares there are color codes etc to make it easier, but in VIA there is not. ‏‏‎ Is there a software to annotate that i could upload my work (CSV) from VIA to do review more easily? (I have 18,000 pictures to review) Thanks! submitted by /u/yellowgorilla556 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
    (0 min) submitted by /u/nickb [link] [comments]
  • Reinforcement Learning

    When should discretization of observations be considered?
    (1 min) I saw some paper that discuss the shape the action space in various environments should have. However, I haven’t seen that much regarding observations. I’m not talking about normalisation of observations, as NNs are mostly used as function approximators in reinforcement learning. I was curious about this because a policy is basically a mapping of actions to states and discretizing the observations will eventually lead to less mapping that has to be learned. Of course this could lead to lower peak performance, but I can also imagine higher training speed. Or is my understanding of it wrong? Does anyone have a reference concerning this topic? submitted by /u/kitaird [link] [comments]
    Soft Actor-Critic: termination step
    (1 min) Hi all! I'm implementing the SAC algorithm for a custom environment. The agent controls a creature that is not supposed to die or reach any final step. And this affects how I train Q-networks: there is no state where the target is just a reward without transition: y = r + γ*(1 - done)*(min(q1Targ(s', a'), q2Targ(s', a')) - α*log(π(s')) here "done" is always 0. Will my policy converge in this case? submitted by /u/CharlottHebb [link] [comments]
    How Amazon Ads uses reinforcement learning
    (0 min) submitted by /u/ordinarytime [link] [comments]
    [newbie] How to make a reward function to solve a toy bit flipping problem?
    (2 min) I am new to RL and am trying the following toy problem using [stable baselines 3][1]. We start with a random array of bits (that is values that are either 1 or 0). The only actions are to flip one of the bits and the goal is to set them all to 1. If I make the input list of length 10 then it works fine using the following simple modification of the tutorial code: ​ import numpy as np import gym from gym import spaces import random class BitFlipEnv(gym.Env): """ Custom Environment that follows gym interface. This is a simple env where the agent must learn to go always left. """ ​ metadata = {'render.modes': ['console']} def __init__(self, array_length=10): super(BitFlipEnv, self).__init__() # Size of the 1D-grid self.array_length = array_length self.agent_pos = random.choic…
    [D] DDPG not converging were actor critic and DQN are - any advice?
    (2 min) Hi all, I am following the keras DDPG workflow https://keras.io/examples/rl/ddpg_pendulum/ for a custom environment I have built which is a graph optimisation problem in which a removal or addition of connection between nodes (as done by the agent) results in a new state. the reward system is based on how beneficial node removal was as measured by a metric. The state is represented by a node embedding (resulting from a node2vec representation of that node at each step) or a CNN/GNN of the new graph at each step (i am experimenting with both)... I have used a smaller version of the graph network (100 nodes) as a test for the exercise - both DQN and actor critic methods converge quickly. In my attempt to scale up the problem to a larger graph, I have changed this to a continuous task and chosen action selection as two dials with a range between 0-n nodes (each dial selecting a node after which the connection will be changed between those nodes) - I am still doing this on the smaller graph just to see that it works. unfortunately, I am having issues where there is no convergence! even though it has seen positive rewards several times - while vague, I was wondering if there are any glaring obvious things that you might advise for why this be happening and what should be tweaked/changed - A couple observations - The agent goes to the extrema of action selection practically immediately (-1,1) (note tanh is untweaked as I simply scale up the actions inside the environment, but it still returns the actions in range -1,1) - replay buffer contains actions in -1,1 range and rewards are in range -1,1. One thing i am unsure about is how much standard deviation i should for noise - any advise is much appreciated, I am happy to share snippets of code if anyone thinks there are particular parts that might be relevant - thank you! submitted by /u/amjass12 [link] [comments]

2022-01-18

  • cs.LG updates on arXiv.org

    Artificial Intelligence in Software Testing : Impact, Problems, Challenges and Prospect. (arXiv:2201.05371v1 [cs.SE])
    (2 min) Artificial Intelligence (AI) is making a significant impact in multiple areas like medical, military, industrial, domestic, law, arts as AI is capable to perform several roles such as managing smart factories, driving autonomous vehicles, creating accurate weather forecasts, detecting cancer and personal assistants, etc. Software testing is the process of putting the software to test for some abnormal behaviour of the software. Software testing is a tedious, laborious and most time-consuming process. Automation tools have been developed that help to automate some activities of the testing process to enhance quality and timely delivery. Over time with the inclusion of continuous integration and continuous delivery (CI/CD) pipeline, automation tools are becoming less effective. The testing community is turning to AI to fill the gap as AI is able to check the code for bugs and errors without any human intervention and in a much faster way than humans. In this study, we aim to recognize the impact of AI technologies on various software testing activities or facets in the STLC. Further, the study aims to recognize and explain some of the biggest challenges software testers face while applying AI to testing. The paper also proposes some key contributions of AI in the future to the domain of software testing.
    Machine Learning of polymer types from the spectral signature of Raman spectroscopy microplastics data. (arXiv:2201.05445v1 [cs.LG])
    (2 min) The tools and technology that are currently used to analyze chemical compound structures that identify polymer types in microplastics are not well-calibrated for environmentally weathered microplastics. Microplastics that have been degraded by environmental weathering factors can offer less analytic certainty than samples of microplastics that have not been exposed to weathering processes. Machine learning tools and techniques allow us to better calibrate the research tools for certainty in microplastics analysis. In this paper, we investigate whether the signatures (Raman shift values) are distinct enough such that well studied machine learning (ML) algorithms can learn to identify polymer types using a relatively small amount of labeled input data when the samples have not been impacted by environmental degradation. Several ML models were trained on a well-known repository, Spectral Libraries of Plastic Particles (SLOPP), that contain Raman shift and intensity results for a range of plastic particles, then tested on environmentally aged plastic particles (SloPP-E) consisting of 22 polymer types. After extensive preprocessing and augmentation, the trained random forest model was then tested on the SloPP-E dataset resulting in an improvement in classification accuracy of 93.81% from 89%.
    Decompositional Quantum Graph Neural Network. (arXiv:2201.05158v1 [quant-ph])
    (2 min) Quantum machine learning is a fast emerging field that aims to tackle machine learning using quantum algorithms and quantum computing. Due to the lack of physical qubits and an effective means to map real-world data from Euclidean space to Hilbert space, most of these methods focus on quantum analogies or process simulations rather than devising concrete architectures based on qubits. In this paper, we propose a novel hybrid quantum-classical algorithm for graph-structured data, which we refer to as the Decompositional Quantum Graph Neural Network (DQGNN). DQGNN implements the GNN theoretical framework using the tensor product and unity matrices representation, which greatly reduces the number of model parameters required. When controlled by a classical computer, DQGNN can accommodate arbitrarily sized graphs by processing substructures from the input graph using a modestly-sized quantum device. The architecture is based on a novel mapping from real-world data to Hilbert space. This mapping maintains the distance relations present in the data and reduces information loss. Experimental results show that the proposed method outperforms competitive state-of-the-art models with only 1.68\% parameters compared to those models.
    Making a (Counterfactual) Difference One Rationale at a Time. (arXiv:2201.05177v1 [cs.CL])
    (2 min) Rationales, snippets of extracted text that explain an inference, have emerged as a popular framework for interpretable natural language processing (NLP). Rationale models typically consist of two cooperating modules: a selector and a classifier with the goal of maximizing the mutual information (MMI) between the "selected" text and the document label. Despite their promises, MMI-based methods often pick up on spurious text patterns and result in models with nonsensical behaviors. In this work, we investigate whether counterfactual data augmentation (CDA), without human assistance, can improve the performance of the selector by lowering the mutual information between spurious signals and the document label. Our counterfactuals are produced in an unsupervised fashion using class-dependent generative models. From an information theoretic lens, we derive properties of the unaugmented dataset for which our CDA approach would succeed. The effectiveness of CDA is empirically evaluated by comparing against several baselines including an improved MMI-based rationale schema on two multi aspect datasets. Our results show that CDA produces rationales that better capture the signal of interest.
    A Deep Generative Model for Molecule Optimization via One Fragment Modification. (arXiv:2012.04231v5 [cs.LG] UPDATED)
    (2 min) Molecule optimization is a critical step in drug development to improve desired properties of drug candidates through chemical modification. We developed a novel deep generative model Modof over molecular graphs for molecule optimization. Modof modifies a given molecule through the prediction of a single site of disconnection at the molecule and the removal and/or addition of fragments at that site. A pipeline of multiple, identical Modof models is implemented into Modof-pipe to modify an input molecule at multiple disconnection sites. Here we show that Modof-pipe is able to retain major molecular scaffolds, allow controls over intermediate optimization steps and better constrain molecule similarities. Modof-pipe outperforms the state-of-the-art methods on benchmark datasets: without molecular similarity constraints, Modof-pipe achieves 81.2% improvement in octanol-water partition coefficient penalized by synthetic accessibility and ring size; and 51.2%, 25.6% and 9.2% improvement if the optimized molecules are at least 0.2, 0.4 and 0.6 similar to those before optimization, respectively. Modof-pipe is further enhanced into Modof-pipem to allow modifying one molecule to multiple optimized ones. Modof-pipem achieves additional performance improvement as at least 17.8% better than Modof-pipe.
    Drop Clause: Enhancing Performance, Interpretability and Robustness of the Tsetlin Machine. (arXiv:2105.14506v2 [cs.LG] UPDATED)
    (2 min) In this article, we introduce a novel variant of the Tsetlin machine (TM) that randomly drops clauses, the key learning elements of a TM. In effect, TM with drop clause ignores a random selection of the clauses in each epoch, selected according to a predefined probability. In this way, additional stochasticity is introduced in the learning phase of TM. To explore the effects drop clause has on accuracy, training time, interpretability and robustness, we conduct extensive experiments on nine benchmark datasets in natural language processing~(NLP) (IMDb, R8, R52, MR and TREC) and image classification (MNIST, Fashion MNIST, CIFAR-10 and CIFAR-100). Our proposed model outperforms baseline machine learning algorithms by a wide margin and achieves competitive performance in comparison with recent deep learning model such as BERT and AlexNET-DFA. In brief, we observe up to +10% increase in accuracy and 2x to 4x faster learning compared with standard TM. We further employ the Convolutional TM to document interpretable results on the CIFAR datasets, visualizing how the heatmaps produced by the TM become more interpretable with drop clause. We also evaluate how drop clause affects learning robustness by introducing corruptions and alterations in the image/language test data. Our results show that drop clause makes TM more robust towards such changes.
    Understanding Bird's-Eye View of Road Semantics using an Onboard Camera. (arXiv:2012.03040v2 [cs.CV] UPDATED)
    (2 min) Autonomous navigation requires scene understanding of the action-space to move or anticipate events. For planner agents moving on the ground plane, such as autonomous vehicles, this translates to scene understanding in the bird's-eye view (BEV). However, the onboard cameras of autonomous cars are customarily mounted horizontally for a better view of the surrounding. In this work, we study scene understanding in the form of online estimation of semantic BEV maps using the video input from a single onboard camera. We study three key aspects of this task, image-level understanding, BEV level understanding, and the aggregation of temporal information. Based on these three pillars we propose a novel architecture that combines these three aspects. In our extensive experiments, we demonstrate that the considered aspects are complementary to each other for BEV understanding. Furthermore, the proposed architecture significantly surpasses the current state-of-the-art. Code: https://github.com/ybarancan/BEV_feat_stitch.
    Contextual Bandits for Advertising Campaigns: A Diffusion-Model Independent Approach (Extended Version). (arXiv:2201.05231v1 [cs.LG])
    (2 min) Motivated by scenarios of information diffusion and advertising in social media, we study an influence maximization problem in which little is assumed to be known about the diffusion network or about the model that determines how information may propagate. In such a highly uncertain environment, one can focus on multi-round diffusion campaigns, with the objective to maximize the number of distinct users that are influenced or activated, starting from a known base of few influential nodes. During a campaign, spread seeds are selected sequentially at consecutive rounds, and feedback is collected in the form of the activated nodes at each round. A round's impact (reward) is then quantified as the number of newly activated nodes. Overall, one must maximize the campaign's total spread, as the sum of rounds' rewards. In this setting, an explore-exploit approach could be used to learn the key underlying diffusion parameters, while running the campaign. We describe and compare two methods of contextual multi-armed bandits, with upper-confidence bounds on the remaining potential of influencers, one using a generalized linear model and the Good-Turing estimator for remaining potential (GLM-GT-UCB), and another one that directly adapts the LinUCB algorithm to our setting (LogNorm-LinUCB). We show that they outperform baseline methods using state-of-the-art ideas, on synthetic and real-world data, while at the same time exhibiting different and complementary behavior, depending on the scenarios in which they are deployed.
    Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning. (arXiv:2102.06866v5 [cs.LG] UPDATED)
    (2 min) Instance discriminative self-supervised representation learning has been attracted attention thanks to its unsupervised nature and informative feature representation for downstream tasks. In practice, it commonly uses a larger number of negative samples than the number of supervised classes. However, there is an inconsistency in the existing analysis; theoretically, a large number of negative samples degrade classification performance on a downstream supervised task, while empirically, they improve the performance. We provide a novel framework to analyze this empirical result regarding negative samples using the coupon collector's problem. Our bound can implicitly incorporate the supervised loss of the downstream task in the self-supervised loss by increasing the number of negative samples. We confirm that our proposed analysis holds on real-world benchmark datasets.
    Proceedings of the 4th Workshop on Online Recommender Systems and User Modeling -- ORSUM 2021. (arXiv:2201.05156v1 [cs.IR])
    (2 min) Modern online services continuously generate data at very fast rates. This continuous flow of data encompasses content -- e.g., posts, news, products, comments --, but also user feedback -- e.g., ratings, views, reads, clicks --, together with context data -- user device, spatial or temporal data, user task or activity, weather. This can be overwhelming for systems and algorithms designed to train in batches, given the continuous and potentially fast change of content, context and user preferences or intents. Therefore, it is important to investigate online methods able to transparently adapt to the inherent dynamics of online services. Incremental models that learn from data streams are gaining attention in the recommender systems community, given their natural ability to deal with the continuous flows of data generated in dynamic, complex environments. User modeling and personalization can particularly benefit from algorithms capable of maintaining models incrementally and online. The objective of this workshop is to foster contributions and bring together a growing community of researchers and practitioners interested in online, adaptive approaches to user modeling, recommendation and personalization, and their implications regarding multiple dimensions, such as evaluation, reproducibility, privacy and explainability.
    Multi-Domain Active Learning: A Comparative Study. (arXiv:2106.13516v4 [cs.LG] UPDATED)
    (2 min) Multi-domain learning (MDL) refers to learning a set of models simultaneously, with each one specialized to perform a task in a certain domain. Generally, high labeling effort is required in MDL, as data needs to be labeled by human experts for every domain. To address the above issue, Active learning (AL) can be utilized to reduce the labeling effort by only using the most informative data. The resultant paradigm is termed multi-domain active learning (MDAL). However, despite the practical significance of MDAL, there exists little research on it, not to mention any off-the-shelf solution. To fill this gap, we construct a simple pipeline of MDAL, and present a comprehensive comparative study of 30 different MDAL algorithms, which are established by combining 6 representative MDL models (equipped with various information-sharing schemes) and 5 well-used AL strategies. We evaluate the algorithms on 6 datasets, involving textual and visual classification tasks. In most cases, AL brings notable improvements to MDL, and surprisingly, the naive best vs second best (BvSB) uncertainty strategy could perform competitively to the state-of-the-art AL strategies. Besides, among the MDL models, MAN and SDL-joint achieve the top performance when applied to vector features and raw images, respectively. Furthermore, we qualitatively analyze the behaviors of these strategies and models, shedding light on their superior performance in the comparison. Overall, some guidelines are provided, which could help to choose MDL models and AL strategies for particular applications.
    Deep Learning for Agile Effort Estimation Have We Solved the Problem Yet?. (arXiv:2201.05401v1 [cs.SE])
    (2 min) In the last decade, several studies have proposed the use of automated techniques to estimate the effort of agile software development. In this paper we perform a close replication and extension of a seminal work proposing the use of Deep Learning for agile effort estimation (namely Deep-SE), which has set the state-of-the-art since. Specifically, we replicate three of the original research questions aiming at investigating the effectiveness of Deep-SE for both within-project and cross-project effort estimation. We benchmark Deep-SE against three baseline techniques (i.e., Random, Mean and Median effort prediction) and a previously proposed method to estimate agile software project development effort (dubbed TF/IDF-SE), as done in the original study. To this end, we use both the data from the original study and a new larger dataset of 31,960 issues, which we mined from 29 open-source projects. Using more data allows us to strengthen our confidence in the results and further mitigate the threat to the external validity of the study. We also extend the original study by investigating two additional research questions. One evaluates the accuracy of Deep-SE when the training set is augmented with issues from all other projects available in the repository at the time of estimation, and the other examines whether an expensive pre-training step used by the original Deep-SE, has any beneficial effect on its accuracy and convergence speed. The results of our replication show that Deep-SE outperforms the Median baseline estimator and TF/IDF-SE in only very few cases with statistical significance (8/42 and 9/32 cases, respectively), thus confounding previous findings on the efficacy of Deep-SE. The two additional RQs revealed that neither augmenting the training set nor pre-training Deep-SE play a role in improving its accuracy and convergence speed. ...
    CommonsenseQA 2.0: Exposing the Limits of AI through Gamification. (arXiv:2201.05320v1 [cs.CL])
    (2 min) Constructing benchmarks that test the abilities of modern natural language understanding models is difficult - pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense. In this work, we propose gamification as a framework for data construction. The goal of players in the game is to compose questions that mislead a rival AI while using specific phrases for extra points. The game environment leads to enhanced user engagement and simultaneously gives the game designer control over the collected data, allowing us to collect high-quality data at scale. Using our method we create CommonsenseQA 2.0, which includes 14,343 yes/no questions, and demonstrate its difficulty for models that are orders-of-magnitude larger than the AI used in the game itself. Our best baseline, the T5-based Unicorn with 11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3 (52.9%) in a few-shot inference setup. Both score well below human performance which is at 94.1%.
    Data Science Methodologies: Current Challenges and Future Approaches. (arXiv:2106.07287v2 [cs.LG] UPDATED)
    (2 min) Data science has employed great research efforts in developing advanced analytics, improving data models and cultivating new algorithms. However, not many authors have come across the organizational and socio-technical challenges that arise when executing a data science project: lack of vision and clear objectives, a biased emphasis on technical issues, a low level of maturity for ad-hoc projects and the ambiguity of roles in data science are among these challenges. Few methodologies have been proposed on the literature that tackle these type of challenges, some of them date back to the mid-1990, and consequently they are not updated to the current paradigm and the latest developments in big data and machine learning technologies. In addition, fewer methodologies offer a complete guideline across team, project and data & information management. In this article we would like to explore the necessity of developing a more holistic approach for carrying out data science projects. We first review methodologies that have been presented on the literature to work on data science projects and classify them according to the their focus: project, team, data and information management. Finally, we propose a conceptual framework containing general characteristics that a methodology for managing data science projects with a holistic point of view should have. This framework can be used by other researchers as a roadmap for the design of new data science methodologies or the updating of existing ones.
    Towards Principled Disentanglement for Domain Generalization. (arXiv:2111.13839v2 [cs.LG] UPDATED)
    (2 min) A fundamental challenge for machine learning models is generalizing to out-of-distribution (OOD) data, in part due to spurious correlations. To tackle this challenge, we first formalize the OOD generalization problem as constrained optimization, called Disentanglement-constrained Domain Generalization (DDG). We relax this non-trivial constrained optimization to a tractable form with finite-dimensional parameterization and empirical approximation. Then a theoretical analysis of the extent to which the above transformations deviates from the original problem is provided. Based on the transformation, we propose a primal-dual algorithm for joint representation disentanglement and domain generalization. In contrast to traditional approaches based on domain adversarial training and domain labels, DDG jointly learns semantic and variation encoders for disentanglement, enabling flexible manipulation and augmentation on training data. DDG aims to learn intrinsic representations of semantic concepts that are invariant to nuisance factors and generalizable across different domains. Comprehensive experiments on popular benchmarks show that DDG can achieve competitive OOD performance and uncover interpretable salient structures within data.
    `Next Generation' Reservoir Computing: an Empirical Data-Driven Expression of Dynamical Equations in Time-Stepping Form. (arXiv:2201.05193v1 [cs.LG])
    (2 min) Next generation reservoir computing based on nonlinear vector autoregression (NVAR) is applied to emulate simple dynamical system models and compared to numerical integration schemes such as Euler and the $2^\text{nd}$ order Runge-Kutta. It is shown that the NVAR emulator can be interpreted as a data-driven method used to recover the numerical integration scheme that produced the data. It is also shown that the approach can be extended to produce high-order numerical schemes directly from data. The impacts of the presence of noise and temporal sparsity in the training set is further examined to gauge the potential use of this method for more realistic applications.
    Jamming Attacks on Federated Learning in Wireless Networks. (arXiv:2201.05172v1 [cs.LG])
    (2 min) Federated learning (FL) offers a decentralized learning environment so that a group of clients can collaborate to train a global model at the server, while keeping their training data confidential. This paper studies how to launch over-the-air jamming attacks to disrupt the FL process when it is executed over a wireless network. As a wireless example, FL is applied to learn how to classify wireless signals collected by clients (spectrum sensors) at different locations (such as in cooperative sensing). An adversary can jam the transmissions for the local model updates from clients to the server (uplink attack), or the transmissions for the global model updates the server to clients (downlink attack), or both. Given a budget imposed on the number of clients that can be attacked per FL round, clients for the (uplink/downlink) attack are selected according to their local model accuracies that would be expected without an attack or ranked via spectrum observations. This novel attack is extended to general settings by accounting different processing speeds and attack success probabilities for clients. Compared to benchmark attack schemes, this attack approach degrades the FL performance significantly, thereby revealing new vulnerabilities of FL to jamming attacks in wireless networks.
    Synthesising Electronic Health Records: Cystic Fibrosis Patient Group. (arXiv:2201.05400v1 [cs.LG])
    (2 min) Class imbalance can often degrade predictive performance of supervised learning algorithms. Balanced classes can be obtained by oversampling exact copies, with noise, or interpolation between nearest neighbours (as in traditional SMOTE methods). Oversampling tabular data using augmentation, as is typical in computer vision tasks, can be achieved with deep generative models. Deep generative models are effective data synthesisers due to their ability to capture complex underlying distributions. Synthetic data in healthcare can enhance interoperability between healthcare providers by ensuring patient privacy. Equipped with large synthetic datasets which do well to represent small patient groups, machine learning in healthcare can address the current challenges of bias and generalisability. This paper evaluates synthetic data generators ability to synthesise patient electronic health records. We test the utility of synthetic data for patient outcome classification, observing increased predictive performance when augmenting imbalanced datasets with synthetic data.
    When less is more: Simplifying inputs aids neural network understanding. (arXiv:2201.05610v1 [cs.LG])
    (2 min) How do neural network image classifiers respond to simpler and simpler inputs? And what do such responses reveal about the learning process? To answer these questions, we need a clear measure of input simplicity (or inversely, complexity), an optimization objective that correlates with simplification, and a framework to incorporate such objective into training and inference. Lastly we need a variety of testbeds to experiment and evaluate the impact of such simplification on learning. In this work, we measure simplicity with the encoding bit size given by a pretrained generative model, and minimize the bit size to simplify inputs in training and inference. We investigate the effect of such simplification in several scenarios: conventional training, dataset condensation and post-hoc explanations. In all settings, inputs are simplified along with the original classification task, and we investigate the trade-off between input simplicity and task performance. For images with injected distractors, such simplification naturally removes superfluous information. For dataset condensation, we find that inputs can be simplified with almost no accuracy degradation. When used in post-hoc explanation, our learning-based simplification approach offers a valuable new tool to explore the basis of network decisions.
    Probabilistic spatial clustering based on the Self Discipline Learning (SDL) model of autonomous learning. (arXiv:2201.03449v2 [cs.LG] UPDATED)
    (2 min) Unsupervised clustering algorithm can effectively reduce the dimension of high-dimensional unlabeled data, thus reducing the time and space complexity of data processing. However, the traditional clustering algorithm needs to set the upper bound of the number of categories in advance, and the deep learning clustering algorithm will fall into the problem of local optimum. In order to solve these problems, a probabilistic spatial clustering algorithm based on the Self Discipline Learning(SDL) model is proposed. The algorithm is based on the Gaussian probability distribution of the probability space distance between vectors, and uses the probability scale and maximum probability value of the probability space distance as the distance measurement judgment, and then determines the category of each sample according to the distribution characteristics of the data set itself. The algorithm is tested in Laboratory for Intelligent and Safe Automobiles(LISA) traffic light data set, the accuracy rate is 99.03%, the recall rate is 91%, and the effect is achieved.
    AuGPT: Auxiliary Tasks and Data Augmentation for End-To-End Dialogue with Pre-Trained Language Models. (arXiv:2102.05126v3 [cs.CL] UPDATED)
    (2 min) Attention-based pre-trained language models such as GPT-2 brought considerable progress to end-to-end dialogue modelling. However, they also present considerable risks for task-oriented dialogue, such as lack of knowledge grounding or diversity. To address these issues, we introduce modified training objectives for language model finetuning, and we employ massive data augmentation via back-translation to increase the diversity of the training data. We further examine the possibilities of combining data from multiples sources to improve performance on the target dataset. We carefully evaluate our contributions with both human and automatic methods. Our model substantially outperforms the baseline on the MultiWOZ data and shows competitive performance with state of the art in both automatic and human evaluation.
    Communication-Efficient ADMM-based Federated Learning. (arXiv:2110.15318v2 [cs.LG] UPDATED)
    (2 min) Federated learning has shown its advances over the last few years but is facing many challenges, such as how algorithms save communication resources, how they reduce computational costs, and whether they converge. To address these issues, this paper proposes exact and inexact ADMM-based federated learning. They are not only communication-efficient but also converge linearly under very mild conditions, such as convexity-free and irrelevance to data distributions. Moreover, the inexact version has low computational complexity, thereby alleviating the computational burdens significantly.
    Segmentation of cell-level anomalies in electroluminescence images of photovoltaic modules. (arXiv:2106.10962v2 [cs.CV] UPDATED)
    (2 min) In the operation & maintenance (O&M) of photovoltaic (PV) plants, the early identification of failures has become crucial to maintain productivity and prolong components' life. Of all defects, cell-level anomalies can lead to serious failures and may affect surrounding PV modules in the long run. These fine defects are usually captured with high spatial resolution electroluminescence (EL) imaging. The difficulty of acquiring such images has limited the availability of data. For this work, multiple data resources and augmentation techniques have been used to surpass this limitation. Current state-of-the-art detection methods extract barely low-level information from individual PV cell images, and their performance is conditioned by the available training data. In this article, we propose an end-to-end deep learning pipeline that detects, locates and segments cell-level anomalies from entire photovoltaic modules via EL images. The proposed modular pipeline combines three deep learning techniques: 1. object detection (modified Faster-RNN), 2. image classification (EfficientNet) and 3. weakly supervised segmentation (autoencoder). The modular nature of the pipeline allows to upgrade the deep learning models to the further improvements in the state-of-the-art and also extend the pipeline towards new functionalities.
    SignalNet: A Low Resolution Sinusoid Decomposition and Estimation Network. (arXiv:2106.05490v2 [eess.SP] UPDATED)
    (2 min) The detection and estimation of sinusoids is a fundamental signal processing task for many applications related to sensing and communications. While algorithms have been proposed for this setting, quantization is a critical, but often ignored modeling effect. In wireless communications, estimation with low resolution data converters is relevant for reduced power consumption in wideband receivers. Similarly, low resolution sampling in imaging and spectrum sensing allows for efficient data collection. In this work, we propose SignalNet, a neural network architecture that detects the number of sinusoids and estimates their parameters from quantized in-phase and quadrature samples. We incorporate signal reconstruction internally as domain knowledge within the network to enhance learning and surpass traditional algorithms in mean squared error and Chamfer error. We introduce a worst-case learning threshold for comparing the results of our network relative to the underlying data distributions. This threshold provides insight into why neural networks tend to outperform traditional methods and into the learned relationships between the input and output distributions. In simulation, we find that our algorithm is always able to surpass the threshold for three-bit data but often cannot exceed the threshold for one-bit data. We use the learning threshold to explain, in the one-bit case, how our estimators learn to minimize the distributional loss, rather than learn features from the data.
    Neural Circuit Architectural Priors for Embodied Control. (arXiv:2201.05242v1 [cs.LG])
    (2 min) Artificial neural networks for simulated motor control and robotics often adopt generic architectures like fully connected MLPs. While general, these tabula rasa architectures rely on large amounts of experience to learn, are not easily transferable to new bodies, and have internal dynamics that are difficult to interpret. In nature, animals are born with highly structured connectivity in their nervous systems shaped by evolution; this innate circuitry acts synergistically with learning mechanisms to provide inductive biases that enable most animals to function well soon after birth and improve abilities efficiently. Convolutional networks inspired by visual circuitry have encoded useful biases for vision. However, it is unknown the extent to which ANN architectures inspired by neural circuitry can yield useful biases for other domains. In this work, we ask what advantages biologically inspired network architecture can provide in the context of motor control. Specifically, we translate C. elegans circuits for locomotion into an ANN model controlling a simulated Swimmer agent. On a locomotion task, our architecture achieves good initial performance and asymptotic performance comparable with MLPs, while dramatically improving data efficiency and requiring orders of magnitude fewer parameters. Our architecture is more interpretable and transfers to new body designs. An ablation analysis shows that principled excitation/inhibition is crucial for learning, while weight initialization contributes to good initial performance. Our work demonstrates several advantages of ANN architectures inspired by systems neuroscience and suggests a path towards modeling more complex behavior.
    A novel notion of barycenter for probability distributions based on optimal weak mass transport. (arXiv:2102.13380v3 [stat.ML] UPDATED)
    (2 min) We introduce weak barycenters of a family of probability distributions, based on the recently developed notion of optimal weak transport of mass by Gozlanet al. (2017) and Backhoff-Veraguas et al. (2020). We provide a theoretical analysis of this object and discuss its interpretation in the light of convex ordering between probability measures. In particular, we show that, rather than averaging the input distributions in a geometric way (as the Wasserstein barycenter based on classic optimal transport does) weak barycenters extract common geometric information shared by all the input distributions, encoded as a latent random variable that underlies all of them. We also provide an iterative algorithm to compute a weak barycenter for a finite family of input distributions, and a stochastic algorithm that computes them for arbitrary populations of laws. The latter approach is particularly well suited for the streaming setting, i.e., when distributions are observed sequentially. The notion of weak barycenter and our approaches to compute it are illustrated on synthetic examples, validated on 2D real-world data and compared to standard Wasserstein barycenters.
    Reproducible, incremental representation learning with Rosetta VAE. (arXiv:2201.05206v1 [cs.LG])
    (2 min) Variational autoencoders are among the most popular methods for distilling low-dimensional structure from high-dimensional data, making them increasingly valuable as tools for data exploration and scientific discovery. However, unlike typical machine learning problems in which a single model is trained once on a single large dataset, scientific workflows privilege learned features that are reproducible, portable across labs, and capable of incrementally adding new data. Ideally, methods used by different research groups should produce comparable results, even without sharing fully trained models or entire data sets. Here, we address this challenge by introducing the Rosetta VAE (R-VAE), a method of distilling previously learned representations and retraining new models to reproduce and build on prior results. The R-VAE uses post hoc clustering over the latent space of a fully-trained model to identify a small number of Rosetta Points (input, latent pairs) to serve as anchors for training future models. An adjustable hyperparameter, $\rho$, balances fidelity to the previously learned latent space against accommodation of new data. We demonstrate that the R-VAE reconstructs data as well as the VAE and $\beta$-VAE, outperforms both methods in recovery of a target latent space in a sequential training setting, and dramatically increases consistency of the learned representation across training runs.
    Precise Stock Price Prediction for Robust Portfolio Design from Selected Sectors of the Indian Stock Market. (arXiv:2201.05570v1 [q-fin.PM])
    (0 min) Stock price prediction is a challenging task and a lot of propositions exist in the literature in this area. Portfolio construction is a process of choosing a group of stocks and investing in them optimally to maximize the return while minimizing the risk. Since the time when Markowitz proposed the Modern Portfolio Theory, several advancements have happened in the area of building efficient portfolios. An investor can get the best benefit out of the stock market if the investor invests in an efficient portfolio and could take the buy or sell decision in advance, by estimating the future asset value of the portfolio with a high level of precision. In this project, we have built an efficient portfolio and to predict the future asset value by means of individual stock price prediction of the stocks in the portfolio. As part of building an efficient portfolio we have studied multiple portfolio optimization methods beginning with the Modern Portfolio theory. We have built the minimum variance portfolio and optimal risk portfolio for all the five chosen sectors by using past daily stock prices over the past five years as the training data, and have also conducted back testing to check the performance of the portfolio. A comparative study of minimum variance portfolio and optimal risk portfolio with equal weight portfolio is done by backtesting.
    IDEA: Interpretable Dynamic Ensemble Architecture for Time Series Prediction. (arXiv:2201.05336v1 [cs.LG])
    (0 min) We enhance the accuracy and generalization of univariate time series point prediction by an explainable ensemble on the fly. We propose an Interpretable Dynamic Ensemble Architecture (IDEA), in which interpretable base learners give predictions independently with sparse communication as a group. The model is composed of several sequentially stacked groups connected by group backcast residuals and recurrent input competition. Ensemble driven by end-to-end training both horizontally and vertically brings state-of-the-art (SOTA) performances. Forecast accuracy improves by 2.6% over the best statistical benchmark on the TOURISM dataset and 2% over the best deep learning benchmark on the M4 dataset. The architecture enjoys several advantages, being applicable to time series from various domains, explainable to users with specialized modular structure and robust to changes in task distribution.
    Convolutional Neural Network Compression through Generalized Kronecker Product Decomposition. (arXiv:2109.14710v2 [cs.CV] UPDATED)
    (0 min) Modern Convolutional Neural Network (CNN) architectures, despite their superiority in solving various problems, are generally too large to be deployed on resource constrained edge devices. In this paper, we reduce memory usage and floating-point operations required by convolutional layers in CNNs. We compress these layers by generalizing the Kronecker Product Decomposition to apply to multidimensional tensors, leading to the Generalized Kronecker Product Decomposition (GKPD). Our approach yields a plug-and-play module that can be used as a drop-in replacement for any convolutional layer. Experimental results for image classification on CIFAR-10 and ImageNet datasets using ResNet, MobileNetv2 and SeNet architectures substantiate the effectiveness of our proposed approach. We find that GKPD outperforms state-of-the-art decomposition methods including Tensor-Train and Tensor-Ring as well as other relevant compression methods such as pruning and knowledge distillation.
    Demystifying Swarm Learning: A New Paradigm of Blockchain-based Decentralized Federated Learning. (arXiv:2201.05286v1 [cs.LG])
    (2 min) Federated learning (FL) is an emerging promising privacy-preserving machine learning paradigm and has raised more and more attention from researchers and developers. FL keeps users' private data on devices and exchanges the gradients of local models to cooperatively train a shared Deep Learning (DL) model on central custodians. However, the security and fault tolerance of FL have been increasingly discussed, because its central custodian mechanism or star-shaped architecture can be vulnerable to malicious attacks or software failures. To address these problems, Swarm Learning (SL) introduces a permissioned blockchain to securely onboard members and dynamically elect the leader, which allows performing DL in an extremely decentralized manner. Compared with tremendous attention to SL, there are few empirical studies on SL or blockchain-based decentralized FL, which provide comprehensive knowledge of best practices and precautions of deploying SL in real-world scenarios. Therefore, we conduct the first comprehensive study of SL to date, to fill the knowledge gap between SL deployment and developers, as far as we are concerned. In this paper, we conduct various experiments on 3 public datasets of 5 research questions, present interesting findings, quantitatively analyze the reasons behind these findings, and provide developers and researchers with practical suggestions. The findings have evidenced that SL is supposed to be suitable for most application scenarios, no matter whether the dataset is balanced, polluted, or biased over irrelevant features.
    Collaborative learning of images and geometrics for predicting isocitrate dehydrogenase status of glioma. (arXiv:2201.05530v1 [cs.LG])
    (0 min) The isocitrate dehydrogenase (IDH) gene mutation status is an important biomarker for glioma patients. The gold standard of IDH mutation detection requires tumour tissue obtained via invasive approaches and is usually expensive. Recent advancement in radiogenomics provides a non-invasive approach for predicting IDH mutation based on MRI. Meanwhile, tumor geometrics encompass crucial information for tumour phenotyping. Here we propose a collaborative learning framework that learns both tumor images and tumor geometrics using convolutional neural networks (CNN) and graph neural networks (GNN), respectively. Our results show that the proposed model outperforms the baseline model of 3D-DenseNet121. Further, the collaborative learning model achieves better performance than either the CNN or the GNN alone. The model interpretation shows that the CNN and GNN could identify common and unique regions of interest for IDH mutation prediction. In conclusion, collaborating image and geometric learners provides a novel approach for predicting genotype and characterising glioma.
    The Implicit Regularization of Momentum Gradient Descent with Early Stopping. (arXiv:2201.05405v1 [cs.LG])
    (0 min) The study on the implicit regularization induced by gradient-based optimization is a longstanding pursuit. In the present paper, we characterize the implicit regularization of momentum gradient descent (MGD) with early stopping by comparing with the explicit $\ell_2$-regularization (ridge). In details, we study MGD in the continuous-time view, so-called momentum gradient flow (MGF), and show that its tendency is closer to ridge than the gradient descent (GD) [Ali et al., 2019] for least squares regression. Moreover, we prove that, under the calibration $t=\sqrt{2/\lambda}$, where $t$ is the time parameter in MGF and $\lambda$ is the tuning parameter in ridge regression, the risk of MGF is no more than 1.54 times that of ridge. In particular, the relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal tuning. The numerical experiments support our theoretical results strongly.
    Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising. (arXiv:2201.05500v1 [cs.IR])
    (0 min) Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. In order to produce a personalized CTR prediction, an industry-level CTR prediction model commonly takes a high-dimensional (e.g., 100 or 1000 billions of features) sparse vector (that is encoded from query keywords, user portraits, etc.) as input. As a result, the model requires Terabyte scale parameters to embed the high-dimensional input. Hierarchical distributed GPU parameter server has been proposed to enable GPU with limited memory to train the massive network by leveraging CPU main memory and SSDs as secondary storage. We identify two major challenges in the existing GPU training framework for massive-scale ad models and propose a collection of optimizations to tackle these challenges: (a) the GPU, CPU, SSD rapidly communicate with each other during the training. The connections between GPUs and CPUs are non-uniform due to the hardware topology. The data communication route should be optimized according to the hardware topology; (b) GPUs in different computing nodes frequently communicates to synchronize parameters. We are required to optimize the communications so that the distributed system can become scalable. In this paper, we propose a hardware-aware training workflow that couples the hardware topology into the algorithm design. To reduce the extensive communication between computing nodes, we introduce a $k$-step model merging algorithm for the popular Adam optimizer and provide its convergence rate in non-convex optimization. To the best of our knowledge, this is the first application of $k$-step adaptive optimization method in industrial-level CTR model training. The numerical results on real-world data confirm that the optimized system design considerably reduces the training time of the massive model, with essentially no loss in accuracy.
    Classification of Brain Tumours in MR Images using Deep Spatiospatial Models. (arXiv:2105.14071v2 [eess.IV] UPDATED)
    (0 min) A brain tumour is a mass or cluster of abnormal cells in the brain, which has the possibility of becoming life-threatening because of its ability to invade neighbouring tissues and also form metastases. An accurate diagnosis is essential for successful treatment planning and magnetic resonance imaging is the principal imaging modality for diagnostic of brain tumours and their extent. Deep Learning methods in computer vision applications have shown significant improvement in recent years, most of which can be credited to the fact that a sizeable amount of data is available to train models on, and the improvements in the model architectures yielding better approximations in a supervised setting. Classifying tumours using such deep learning methods has made significant progress with the availability of open datasets with reliable annotations. Typically those methods are either 3D models, which use 3D volumetric MRIs or even 2D models considering each slice separately. However, by treating the slice spatial dimension separately, spatiotemporal models can be employed as spatiospatial models for this task. These models have the capabilities of learning specific spatial and temporal relationship, while reducing computational costs. This paper uses two spatiotemporal models, ResNet (2+1)D and ResNet Mixed Convolution, to classify different types of brain tumours. It was observed that both these models performed superior to the pure 3D convolutional model, ResNet18. Furthermore, it was also observed that pre-training the models on a different, even unrelated dataset before training them for the task of tumour classification improves the performance. Finally, Pre-trained ResNet Mixed Convolution was observed to be the best model in these experiments, achieving a macro F1-score of 0.93 and a test accuracy of 96.98\%, while at the same time being the model with the least computational cost.
    Consistent Approximations in Composite Optimization. (arXiv:2201.05250v1 [math.OC])
    (0 min) Approximations of optimization problems arise in computational procedures and sensitivity analysis. The resulting effect on solutions can be significant, with even small approximations of components of a problem translating into large errors in the solutions. We specify conditions under which approximations are well behaved in the sense of minimizers, stationary points, and level-sets and this leads to a framework of consistent approximations. The framework is developed for a broad class of composite problems, which are neither convex nor smooth. We demonstrate the framework using examples from stochastic optimization, neural-network based machine learning, distributionally robust optimization, penalty and augmented Lagrangian methods, interior-point methods, homotopy methods, smoothing methods, extended nonlinear programming, difference-of-convex programming, and multi-objective optimization. An enhanced proximal method illustrates the algorithmic possibilities. A quantitative analysis supplements the development by furnishing rates of convergence.
    Decentralized Robot Learning for Personalization and Privacy. (arXiv:2201.05527v1 [cs.LG])
    (2 min) From learning assistance to companionship, social robots promise to enhance many aspects of daily life. However, social robots have not seen widespread adoption, in part because (1) they do not adapt their behavior to new users, and (2) they do not provide sufficient privacy protections. Centralized learning, whereby robots develop skills by gathering data on a server, contributes to these limitations by preventing online learning of new experiences and requiring storage of privacy-sensitive data. In this work, we propose a decentralized learning alternative that improves the privacy and personalization of social robots. We combine two machine learning approaches, Federated Learning and Continual Learning, to capture interaction dynamics distributed physically across robots and temporally across repeated robot encounters. We define a set of criteria that should be balanced in decentralized robot learning scenarios. We also develop a new algorithm -- Elastic Transfer -- that leverages importance-based regularization to preserve relevant parameters across robots and interactions with multiple humans. We show that decentralized learning is a viable alternative to centralized learning in a proof-of-concept Socially-Aware Navigation domain, and demonstrate how Elastic Transfer improves several of the proposed criteria.
    Data Fusion with Latent Map Gaussian Processes. (arXiv:2112.02206v2 [stat.ML] UPDATED)
    (0 min) Multi-fidelity modeling and calibration are data fusion tasks that ubiquitously arise in engineering design. In this paper, we introduce a novel approach based on latent-map Gaussian processes (LMGPs) that enables efficient and accurate data fusion. In our approach, we convert data fusion into a latent space learning problem where the relations among different data sources are automatically learned. This conversion endows our approach with attractive advantages such as increased accuracy, reduced costs, flexibility to jointly fuse any number of data sources, and ability to visualize correlations between data sources. This visualization allows the user to detect model form errors or determine the optimum strategy for high-fidelity emulation by fitting LMGP only to the subset of the data sources that are well-correlated. We also develop a new kernel function that enables LMGPs to not only build a probabilistic multi-fidelity surrogate but also estimate calibration parameters with high accuracy and consistency. The implementation and use of our approach are considerably simpler and less prone to numerical issues compared to existing technologies. We demonstrate the benefits of LMGP-based data fusion by comparing its performance against competing methods on a wide range of examples.
    Use of High Dimensional Modeling for automatic variables selection: the best path algorithm. (arXiv:2105.03173v2 [stat.ML] UPDATED)
    (0 min) This paper presents a new algorithm for automatic variables selection. In particular, using the Graphical Models properties it is possible to develop a method that can be used in the contest of large dataset. The advantage of this algorithm is that can be combined with different forecasting models. In this research we have used the OLS method and we have compared the result with the LASSO method.
    E(n) Equivariant Normalizing Flows. (arXiv:2105.09016v4 [cs.LG] UPDATED)
    (0 min) This paper introduces a generative model equivariant to Euclidean symmetries: E(n) Equivariant Normalizing Flows (E-NFs). To construct E-NFs, we take the discriminative E(n) graph neural networks and integrate them as a differential equation to obtain an invertible equivariant function: a continuous-time normalizing flow. We demonstrate that E-NFs considerably outperform baselines and existing methods from the literature on particle systems such as DW4 and LJ13, and on molecules from QM9 in terms of log-likelihood. To the best of our knowledge, this is the first flow that jointly generates molecule features and positions in 3D.
    Probabilistic Mass Mapping with Neural Score Estimation. (arXiv:2201.05561v1 [astro-ph.CO])
    (0 min) Weak lensing mass-mapping is a useful tool to access the full distribution of dark matter on the sky, but because of intrinsic galaxy ellipticies and finite fields/missing data, the recovery of dark matter maps constitutes a challenging ill-posed inverse problem. We introduce a novel methodology allowing for efficient sampling of the high-dimensional Bayesian posterior of the weak lensing mass-mapping problem, and relying on simulations for defining a fully non-Gaussian prior. We aim to demonstrate the accuracy of the method on simulations, and then proceed to applying it to the mass reconstruction of the HST/ACS COSMOS field. The proposed methodology combines elements of Bayesian statistics, analytic theory, and a recent class of Deep Generative Models based on Neural Score Matching. This approach allows us to do the following: 1) Make full use of analytic cosmological theory to constrain the 2pt statistics of the solution. 2) Learn from cosmological simulations any differences between this analytic prior and full simulations. 3) Obtain samples from the full Bayesian posterior of the problem for robust Uncertainty Quantification. We demonstrate the method on the $\kappa$TNG simulations and find that the posterior mean significantly outperfoms previous methods (Kaiser-Squires, Wiener filter, Sparsity priors) both on root-mean-square error and in terms of the Pearson correlation. We further illustrate the interpretability of the recovered posterior by establishing a close correlation between posterior convergence values and SNR of clusters artificially introduced into a field. Finally, we apply the method to the reconstruction of the HST/ACS COSMOS field and yield the highest quality convergence map of this field to date.
    Tactical Optimism and Pessimism for Deep Reinforcement Learning. (arXiv:2102.03765v4 [cs.LG] UPDATED)
    (0 min) In recent years, deep off-policy actor-critic algorithms have become a dominant approach to reinforcement learning for continuous control. One of the primary drivers of this improved performance is the use of pessimistic value updates to address function approximation errors, which previously led to disappointing performance. However, a direct consequence of pessimism is reduced exploration, running counter to theoretical support for the efficacy of optimism in the face of uncertainty. So which approach is best? In this work, we show that the most effective degree of optimism can vary both across tasks and over the course of learning. Inspired by this insight, we introduce a novel deep actor-critic framework, Tactical Optimistic and Pessimistic (TOP) estimation, which switches between optimistic and pessimistic value learning online. This is achieved by formulating the selection as a multi-arm bandit problem. We show in a series of continuous control tasks that TOP outperforms existing methods which rely on a fixed degree of optimism, setting a new state of the art in challenging pixel-based environments. Since our changes are simple to implement, we believe these insights can easily be incorporated into a multitude of off-policy algorithms.
    Multimodal analysis of the predictability of hand-gesture properties. (arXiv:2108.05762v3 [cs.HC] UPDATED)
    (0 min) Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned FastText embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from audio features better than from text. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.
    StAnD: A Dataset of Linear Static Analysis Problems. (arXiv:2201.05356v1 [cs.LG])
    (0 min) Static analysis of structures is a fundamental step for determining the stability of structures. Both linear and non-linear static analyses consist of the resolution of sparse linear systems obtained by the finite element method. The development of fast and optimized solvers for sparse linear systems appearing in structural engineering requires data to compare existing approaches, tune algorithms or to evaluate new ideas. We introduce the Static Analysis Dataset (StAnD) containing 303.000 static analysis problems obtained applying realistic loads to simulated frame structures. Along with the dataset, we publish a detailed benchmark comparison of the running time of existing solvers both on CPU and GPU. We release the code used to generate the dataset and benchmark existing solvers on Github. To the best of our knowledge, this is the largest dataset for static analysis problems and it is the first public dataset of sparse linear systems (containing both the matrix and a realistic constant term).
    Towards a Data Privacy-Predictive Performance Trade-off. (arXiv:2201.05226v1 [cs.LG])
    (2 min) Machine learning is increasingly used in the most diverse applications and domains, whether in healthcare, to predict pathologies, or in the financial sector to detect fraud. One of the linchpins for efficiency and accuracy in machine learning is data utility. However, when it contains personal information, full access may be restricted due to laws and regulations aiming to protect individuals' privacy. Therefore, data owners must ensure that any data shared guarantees such privacy. Removal or transformation of private information (de-identification) are among the most common techniques. Intuitively, one can anticipate that reducing detail or distorting information would result in losses for model predictive performance. However, previous work concerning classification tasks using de-identified data generally demonstrates that predictive performance can be preserved in specific applications. In this paper, we aim to evaluate the existence of a trade-off between data privacy and predictive performance in classification tasks. We leverage a large set of privacy-preserving techniques and learning algorithms to provide an assessment of re-identification ability and the impact of transformed variants on predictive performance. Unlike previous literature, we confirm that the higher the level of privacy (lower re-identification risk), the higher the impact on predictive performance, pointing towards clear evidence of a trade-off.
    Robust Explainability: A Tutorial on Gradient-Based Attribution Methods for Deep Neural Networks. (arXiv:2107.11400v4 [cs.LG] UPDATED)
    (0 min) With the rise of deep neural networks, the challenge of explaining the predictions of these networks has become increasingly recognized. While many methods for explaining the decisions of deep neural networks exist, there is currently no consensus on how to evaluate them. On the other hand, robustness is a popular topic for deep learning research; however, it is hardly talked about in explainability until very recently. In this tutorial paper, we start by presenting gradient-based interpretability methods. These techniques use gradient signals to assign the burden of the decision on the input features. Later, we discuss how gradient-based methods can be evaluated for their robustness and the role that adversarial robustness plays in having meaningful explanations. We also discuss the limitations of gradient-based methods. Finally, we present the best practices and attributes that should be examined before choosing an explainability method. We conclude with the future directions for research in the area at the convergence of robustness and explainability.
    Generalization Error Bounds for Iterative Recovery Algorithms Unfolded as Neural Networks. (arXiv:2112.04364v2 [cs.LG] UPDATED)
    (2 min) Motivated by the learned iterative soft thresholding algorithm (LISTA), we introduce a general class of neural networks suitable for sparse reconstruction from few linear measurements. By allowing a wide range of degrees of weight-sharing between the layers, we enable a unified analysis for very different neural network types, ranging from recurrent ones to networks more similar to standard feedforward neural networks. Based on training samples, via empirical risk minimization we aim at learning the optimal network parameters and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks, that also take into account the thresholding parameters. We obtain estimates of the sample complexity that essentially depend only linearly on the number of parameters and on the depth. We apply our main result to obtain specific generalization bounds for several practical examples, including different algorithms for (implicit) dictionary learning, and convolutional neural networks.
    A Method for Controlling Extrapolation when Visualizing and Optimizing the Prediction Profiles of Statistical and Machine Learning Models. (arXiv:2201.05236v1 [stat.ML])
    (0 min) We present a novel method for controlling extrapolation in the prediction profiler in the JMP software. The prediction profiler is a graphical tool for exploring high dimensional prediction surfaces for statistical and machine learning models. The profiler contains interactive cross-sectional views, or profile traces, of the prediction surface of a model. Our method helps users avoid exploring predictions that should be considered extrapolation. It also performs optimization over a constrained factor region that avoids extrapolation using a genetic algorithm. In simulations and real world examples, we demonstrate how optimal factor settings without constraint in the profiler are frequently extrapolated, and how extrapolation control helps avoid these solutions with invalid factor settings that may not be useful to the user.
    A New Deep Hybrid Boosted and Ensemble Learning-based Brain Tumor Analysis using MRI. (arXiv:2201.05373v1 [eess.IV])
    (0 min) Brain tumors analysis is important in timely diagnosis and effective treatment to cure patients. Tumor analysis is challenging because of tumor morphology like size, location, texture, and heteromorphic appearance in the medical images. In this regard, a novel two-phase deep learning-based framework is proposed to detect and categorize brain tumors in magnetic resonance images (MRIs). In the first phase, a novel deep boosted features and ensemble classifiers (DBF-EC) scheme is proposed to detect tumor MRI images from healthy individuals effectively. The deep boosted feature space is achieved through the customized and well-performing deep convolutional neural networks (CNNs), and consequently, fed into the ensemble of machine learning (ML) classifiers. While in the second phase, a new hybrid features fusion-based brain tumor classification approach is proposed, comprised of dynamic-static feature and ML classifier to categorize different tumor types. The dynamic features are extracted from the proposed BRAIN-RENet CNN, which carefully learns heteromorphic and inconsistent behavior of various tumors, while the static features are extracted using HOG. The effectiveness of the proposed two-phase brain tumor analysis framework is validated on two standard benchmark datasets; collected from Kaggle and Figshare containing different types of tumor, including glioma, meningioma, pituitary, and normal images. Experimental results proved that the proposed DBF-EC detection scheme outperforms and achieved accuracy (99.56%), precision (0.9991), recall (0.9899), F1-Score (0.9945), MCC (0.9892), and AUC-PR (0.9990). While the classification scheme, the joint employment of the deep features fusion of proposed BRAIN-RENet and HOG features improves performance significantly in terms of recall (0.9913), precision (0.9906), F1-Score (0.9909), and accuracy (99.20%) on diverse datasets.
    Domain-shift adaptation via linear transformations. (arXiv:2201.05282v1 [cs.LG])
    (0 min) A predictor, $f_A : X \to Y$, learned with data from a source domain (A) might not be accurate on a target domain (B) when their distributions are different. Domain adaptation aims to reduce the negative effects of this distribution mismatch. Here, we analyze the case where $P_A(Y\ |\ X) \neq P_B(Y\ |\ X)$, $P_A(X) \neq P_B(X)$ but $P_A(Y) = P_B(Y)$; where there are affine transformations of $X$ that makes all distributions equivalent. We propose an approach to project the source and target domains into a lower-dimensional, common space, by (1) projecting the domains into the eigenvectors of the empirical covariance matrices of each domain, then (2) finding an orthogonal matrix that minimizes the maximum mean discrepancy between the projections of both domains. For arbitrary affine transformations, there is an inherent unidentifiability problem when performing unsupervised domain adaptation that can be alleviated in the semi-supervised case. We show the effectiveness of our approach in simulated data and in binary digit classification tasks, obtaining improvements up to 48% accuracy when correcting for the domain shift in the data.
    Machine Learning for Multi-Output Regression: When should a holistic multivariate approach be preferred over separate univariate ones?. (arXiv:2201.05340v1 [stat.ML])
    (0 min) Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods. In particular, they are used for predicting univariate responses. In case of multiple outputs the question arises whether we separately fit univariate models or directly follow a multivariate approach. For the latter, several possibilities exist that are, e.g. based on modified splitting or stopping rules for multi-output regression. In this work we compare these methods in extensive simulations to help in answering the primary question when to use multivariate ensemble techniques.
    On Preference Learning Based on Sequential Bayesian Optimization with Pairwise Comparison. (arXiv:2103.13192v2 [cs.LG] UPDATED)
    (0 min) User preference learning is generally a hard problem. Individual preferences are typically unknown even to users themselves, while the space of choices is infinite. Here we study user preference learning from information-theoretic perspective. We model preference learning as a system with two interacting sub-systems, one representing a user with his/her preferences and another one representing an agent that has to learn these preferences. The user with his/her behaviour is modeled by a parametric preference function. To efficiently learn the preferences and reduce search space quickly, we propose the agent that interacts with the user to collect the most informative data for learning. The agent presents two proposals to the user for evaluation, and the user rates them based on his/her preference function. We show that the optimum agent strategy for data collection and preference learning is a result of maximin optimization of the normalized weighted Kullback-Leibler (KL) divergence between true and agent-assigned predictive user response distributions. The resulting value of KL-divergence, which we also call remaining system uncertainty (RSU), provides an efficient performance metric in the absence of the ground truth. This metric characterises how well the agent can predict user and, thus, the quality of the underlying learned user (preference) model. Our proposed agent comprises sequential mechanisms for user model inference and proposal generation. To infer the user model (preference function), Bayesian approximate inference is used in the agent. The data collection strategy is to generate proposals, responses to which help resolving uncertainty associated with prediction of the user responses the most. The efficiency of our approach is validated by numerical simulations.
    Replay-Guided Adversarial Environment Design. (arXiv:2110.02439v2 [cs.LG] UPDATED)
    (0 min) Deep reinforcement learning (RL) agents may successfully generalize to new settings if trained on an appropriately diverse set of environment and task configurations. Unsupervised Environment Design (UED) is a promising self-supervised RL paradigm, wherein the free parameters of an underspecified environment are automatically adapted during training to the agent's capabilities, leading to the emergence of diverse training environments. Here, we cast Prioritized Level Replay (PLR), an empirically successful but theoretically unmotivated method that selectively samples randomly-generated training levels, as UED. We argue that by curating completely random levels, PLR, too, can generate novel and complex levels for effective training. This insight reveals a natural class of UED methods we call Dual Curriculum Design (DCD). Crucially, DCD includes both PLR and a popular UED algorithm, PAIRED, as special cases and inherits similar theoretical guarantees. This connection allows us to develop novel theory for PLR, providing a version with a robustness guarantee at Nash equilibria. Furthermore, our theory suggests a highly counterintuitive improvement to PLR: by stopping the agent from updating its policy on uncurated levels (training on less data), we can improve the convergence to Nash equilibria. Indeed, our experiments confirm that our new method, PLR$^{\perp}$, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks, in addition to demonstrating that PLR$^{\perp}$ improves the performance of PAIRED, from which it inherited its theoretical framework.
    Learning Enhancement of CNNs via Separation Index Maximizing at the First Convolutional Layer. (arXiv:2201.05217v1 [cs.LG])
    (0 min) In this paper, a straightforward enhancement learning algorithm based on Separation Index (SI) concept is proposed for Convolutional Neural Networks (CNNs). At first, the SI as a supervised complexity measure is explained its usage in better learning of CNNs for classification problems illustrate. Then, a learning strategy proposes through which the first layer of a CNN is optimized by maximizing the SI, and the further layers are trained through the backpropagation algorithm to learn further layers. In order to maximize the SI at the first layer, A variant of ranking loss is optimized by using the quasi least square error technique. Applying such a learning strategy to some known CNNs and datasets, its enhancement impact in almost all cases is demonstrated.
    RecoMed: A Knowledge-Aware Recommender System for Hypertension Medications. (arXiv:2201.05461v1 [cs.IR])
    (0 min) Background and Objective High medicine diversity has always been a significant challenge for prescription, causing confusion or doubt in physicians' decision-making process. This paper aims to develop a medicine recommender system called RecoMed to aid the physician in the prescription process of hypertension by providing information about what medications have been prescribed by other doctors and figuring out what other medicines can be recommended in addition to the one in question. Methods There are two steps to the developed method: First, association rule mining algorithms are employed to find medicine association rules. The second step entails graph mining and clustering to present an enriched recommendation via ATC code, which itself comprises several steps. First, the initial graph is constructed from historical prescription data. Then, data pruning is performed in the second step, after which the medicines with a high repetition rate are removed at the discretion of a general medical practitioner. Next, the medicines are matched to a well-known medicine classification system called the ATC code to provide an enriched recommendation. And finally, the DBSCAN and Louvain algorithms cluster medicines in the final step. Results A list of recommended medicines is provided as the system's output, and physicians can choose one or more of the medicines based on the patient's clinical symptoms. Only the medicines of class 2, related to high blood pressure medications, are used to assess the system's performance. The results obtained from this system have been reviewed and confirmed by an expert in this field.
    Compact Graph Structure Learning via Mutual Information Compression. (arXiv:2201.05540v1 [cs.LG])
    (0 min) Graph Structure Learning (GSL) recently has attracted considerable attentions in its capacity of optimizing graph structure as well as learning suitable parameters of Graph Neural Networks (GNNs) simultaneously. Current GSL methods mainly learn an optimal graph structure (final view) from single or multiple information sources (basic views), however the theoretical guidance on what is the optimal graph structure is still unexplored. In essence, an optimal graph structure should only contain the information about tasks while compress redundant noise as much as possible, which is defined as "minimal sufficient structure", so as to maintain the accurancy and robustness. How to obtain such structure in a principled way? In this paper, we theoretically prove that if we optimize basic views and final view based on mutual information, and keep their performance on labels simultaneously, the final view will be a minimal sufficient structure. With this guidance, we propose a Compact GSL architecture by MI compression, named CoGSL. Specifically, two basic views are extracted from original graph as two inputs of the model, which are refinedly reestimated by a view estimator. Then, we propose an adaptive technique to fuse estimated views into the final view. Furthermore, we maintain the performance of estimated views and the final view and reduce the mutual information of every two views. To comprehensively evaluate the performance of CoGSL, we conduct extensive experiments on several datasets under clean and attacked conditions, which demonstrate the effectiveness and robustness of CoGSL.
    AWSnet: An Auto-weighted Supervision Attention Network for Myocardial Scar and Edema Segmentation in Multi-sequence Cardiac Magnetic Resonance Images. (arXiv:2201.05344v1 [eess.IV])
    (0 min) Multi-sequence cardiac magnetic resonance (CMR) provides essential pathology information (scar and edema) to diagnose myocardial infarction. However, automatic pathology segmentation can be challenging due to the difficulty of effectively exploring the underlying information from the multi-sequence CMR data. This paper aims to tackle the scar and edema segmentation from multi-sequence CMR with a novel auto-weighted supervision framework, where the interactions among different supervised layers are explored under a task-specific objective using reinforcement learning. Furthermore, we design a coarse-to-fine framework to boost the small myocardial pathology region segmentation with shape prior knowledge. The coarse segmentation model identifies the left ventricle myocardial structure as a shape prior, while the fine segmentation model integrates a pixel-wise attention strategy with an auto-weighted supervision model to learn and extract salient pathological structures from the multi-sequence CMR data. Extensive experimental results on a publicly available dataset from Myocardial pathology segmentation combining multi-sequence CMR (MyoPS 2020) demonstrate our method can achieve promising performance compared with other state-of-the-art methods. Our method is promising in advancing the myocardial pathology assessment on multi-sequence CMR data. To motivate the community, we have made our code publicly available via https://github.com/soleilssss/AWSnet/tree/master.
    Learning from Mistakes -- A Framework for Neural Architecture Search. (arXiv:2111.06353v2 [cs.LG] UPDATED)
    (0 min) Learning from one's mistakes is an effective human learning technique where the learners focus more on the topics where mistakes were made, so as to deepen their understanding. In this paper, we investigate if this human learning strategy can be applied in machine learning. We propose a novel machine learning method called Learning From Mistakes (LFM), wherein the learner improves its ability to learn by focusing more on the mistakes during revision. We formulate LFM as a three-stage optimization problem: 1) learner learns; 2) learner re-learns focusing on the mistakes, and; 3) learner validates its learning. We develop an efficient algorithm to solve the LFM problem. We apply the LFM framework to neural architecture search on CIFAR-10, CIFAR-100, and Imagenet. Experimental results strongly demonstrate the effectiveness of our model.
    Multi-modal Attention Network for Stock Movements Prediction. (arXiv:2112.13593v2 [cs.LG] UPDATED)
    (0 min) Stock prices move as piece-wise trending fluctuation rather than a purely random walk. Traditionally, the prediction of future stock movements is based on the historical trading record. Nowadays, with the development of social media, many active participants in the market choose to publicize their strategies, which provides a window to glimpse over the whole market's attitude towards future movements by extracting the semantics behind social media. However, social media contains conflicting information and cannot replace historical records completely. In this work, we propose a multi-modality attention network to reduce conflicts and integrate semantic and numeric features to predict future stock movements comprehensively. Specifically, we first extract semantic information from social media and estimate their credibility based on posters' identity and public reputation. Then we incorporate the semantic from online posts and numeric features from historical records to make the trading strategy. Experimental results show that our approach outperforms previous methods by a significant margin in both prediction accuracy (61.20\%) and trading profits (9.13\%). It demonstrates that our method improves the performance of stock movements prediction and informs future research on multi-modality fusion towards stock prediction.
    Assemble Foundation Models for Automatic Code Summarization. (arXiv:2201.05222v1 [cs.SE])
    (0 min) Automatic code summarization is beneficial to software development and maintenance since it reduces the burden of manual tasks. Currently, artificial intelligence is undergoing a paradigm shift. The foundation models pretrained on massive data and finetuned to downstream tasks surpass specially customized models. This trend inspired us to consider reusing foundation models instead of learning from scratch. Based on this, we propose a flexible and robust approach for automatic code summarization based on neural networks. We assemble available foundation models, such as CodeBERT and GPT-2, into a single model named AdaMo. Moreover, we utilize Gaussian noise as the simulation of contextual information to optimize the latent representation. Furthermore, we introduce two adaptive schemes from the perspective of knowledge transfer, namely continuous pretraining and intermediate finetuning, and design intermediate stage tasks for general sequence-to-sequence learning. Finally, we evaluate AdaMo against a benchmark dataset for code summarization, by comparing it with state-of-the-art models.
    Security Orchestration, Automation, and Response Engine for Deployment of Behavioural Honeypots. (arXiv:2201.05326v1 [cs.CR])
    (0 min) Cyber Security is a critical topic for organizations with IT/OT networks as they are always susceptible to attack, whether insider or outsider. Since the cyber landscape is an ever-evolving scenario, one must keep upgrading its security systems to enhance the security of the infrastructure. Tools like Security Information and Event Management (SIEM), Endpoint Detection and Response (EDR), Threat Intelligence Platform (TIP), Information Technology Service Management (ITSM), along with other defensive techniques like Intrusion Detection System (IDS), Intrusion Protection System (IPS), and many others enhance the cyber security posture of the infrastructure. However, the proposed protection mechanisms have their limitations, they are insufficient to ensure security, and the attacker penetrates the network. Deception technology, along with Honeypots, provides a false sense of vulnerability in the target systems to the attackers. The attacker deceived reveals threat intel about their modus operandi. We have developed a Security Orchestration, Automation, and Response (SOAR) Engine that dynamically deploys custom honeypots inside the internal network infrastructure based on the attacker's behavior. The architecture is robust enough to support multiple VLANs connected to the system and used for orchestration. The presence of botnet traffic and DDOS attacks on the honeypots in the network is detected, along with a malware collection system. After being exposed to live traffic for four days, our engine dynamically orchestrated the honeypots 40 times, detected 7823 attacks, 965 DDOS attack packets, and three malicious samples. While our experiments with static honeypots show an average attacker engagement time of 102 seconds per instance, our SOAR Engine-based dynamic honeypots engage attackers on average 3148 seconds.
    Impact of Stop Sets on Stopping Active Learning for Text Classification. (arXiv:2201.05460v1 [cs.IR])
    (0 min) Active learning is an increasingly important branch of machine learning and a powerful technique for natural language processing. The main advantage of active learning is its potential to reduce the amount of labeled data needed to learn high-performing models. A vital aspect of an effective active learning algorithm is the determination of when to stop obtaining additional labeled data. Several leading state-of-the-art stopping methods use a stop set to help make this decision. However, there has been relatively less attention given to the choice of stop set than to the stopping algorithms that are applied on the stop set. Different choices of stop sets can lead to significant differences in stopping method performance. We investigate the impact of different stop set choices on different stopping methods. This paper shows the choice of the stop set can have a significant impact on the performance of stopping methods and the impact is different for stability-based methods from that on confidence-based methods. Furthermore, the unbiased representative stop sets suggested by original authors of methods work better than the systematically biased stop sets used in recently published work, and stopping methods based on stabilizing predictions have stronger performance than confidence-based stopping methods when unbiased representative stop sets are used. We provide the largest quantity of experimental results on the impact of stop sets to date. The findings are important for helping to illuminate the impact of this important aspect of stopping methods that has been under-considered in recently published work and that can have a large practical impact on the performance of stopping methods for important semantic computing applications such as technology assisted review and text classification more broadly.
    Networks with pixels embedding: a method to improve noise resistance in images classification. (arXiv:2005.11679v3 [cs.CV] UPDATED)
    (0 min) In the task of image classification, usually, the network is sensitive to noises. For example, an image of cat with noises might be misclassified as an ostrich. Conventionally, to overcome the problem of noises, one uses the technique of data augmentation, that is, to teach the network to distinguish noises by adding more images with noises in the training dataset. In this work, we provide a noise-resistance network in images classification by introducing a technique of pixel embedding. We test the network with pixel embedding, which is abbreviated as the network with PE, on the mnist database of handwritten digits. It shows that the network with PE outperforms the conventional network on images with noises. The technique of pixel embedding can be used in many tasks of image classification to improve noise resistance.
    Radar Occupancy Prediction with Lidar Supervision while Preserving Long-Range Sensing and Penetrating Capabilities. (arXiv:2112.04282v2 [cs.RO] UPDATED)
    (0 min) Radar shows great potential for autonomous driving by accomplishing long-range sensing under diverse weather conditions. But radar is also a particularly challenging sensing modality due to the radar noises. Recent works have made enormous progress in classifying free and occupied spaces in radar images by leveraging lidar label supervision. However, there are still several unsolved issues. Firstly, the sensing distance of the results is limited by the sensing range of lidar. Secondly, the performance of the results is degenerated by lidar due to the physical sensing discrepancies between the two sensors. For example, some objects visible to lidar are invisible to radar, and some objects occluded in lidar scans are visible in radar images because of the radar's penetrating capability. These sensing differences cause false positive and penetrating capability degeneration, respectively. In this paper, we propose training data preprocessing and polar sliding window inference to solve the issues. The data preprocessing aims to reduce the effect caused by radar-invisible measurements in lidar scans. The polar sliding window inference aims to solve the limited sensing range issue by applying a near-range trained network to the long-range region. Instead of using common Cartesian representation, we propose to use polar representation to reduce the shape dissimilarity between long-range and near-range data. We find that extending a near-range trained network to long-range region inference in the polar space has 4.2 times better IoU than in Cartesian space. Besides, the polar sliding window inference can preserve the radar penetrating capability by changing the viewpoint of the inference region, which makes some occluded measurements seem non-occluded for a pretrained network.
    Fast and accurate waveform modeling of long-haul multi-channel optical fiber transmission using a hybrid model-data driven scheme. (arXiv:2201.05502v1 [eess.SP])
    (0 min) The modeling of optical wave propagation in optical fiber is a task of fast and accurate solving the nonlinear Schr\"odinger equation (NLSE), and can enable the research progress and system design of optical fiber communications, which are the infrastructure of modern communication systems. Traditional modeling of fiber channels using the split-step Fourier method (SSFM) has long been regarded as challenging in long-haul wavelength division multiplexing (WDM) optical fiber communication systems because it is extremely time-consuming. Here we propose a linear-nonlinear feature decoupling distributed (FDD) waveform modeling scheme to model long-haul WDM fiber channel, where the channel linear effects are modelled by the NLSE-derived model-driven methods and the nonlinear effects are modelled by the data-driven deep learning methods. Meanwhile, the proposed scheme only focuses on one-span fiber distance fitting, and then recursively transmits the model to achieve the required transmission distance. The proposed modeling scheme is demonstrated to have high accuracy, high computing speeds, and robust generalization abilities for different optical launch powers, modulation formats, channel numbers and transmission distances. The total running time of FDD waveform modeling scheme for 41-channel 1040-km fiber transmission is only 3 minutes versus more than 2 hours using SSFM for each input condition, which achieves a 98% reduction in computing time. Considering the multi-round optimization by adjusting system parameters, the complexity reduction is significant. The results represent a remarkable improvement in nonlinear fiber modeling and open up novel perspectives for solution of NLSE-like partial differential equations and optical fiber physics problems.
    Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. (arXiv:2106.04619v4 [stat.ML] UPDATED)
    (0 min) Self-supervised representation learning has shown remarkable success in a number of domains. A common practice is to perform data augmentation via hand-crafted transformations intended to leave the semantics of the data invariant. We seek to understand the empirical success of this approach from a theoretical perspective. We formulate the augmentation process as a latent variable model by postulating a partition of the latent representation into a content component, which is assumed invariant to augmentation, and a style component, which is allowed to change. Unlike prior work on disentanglement and independent component analysis, we allow for both nontrivial statistical and causal dependencies in the latent space. We study the identifiability of the latent representation based on pairs of views of the observations and prove sufficient conditions that allow us to identify the invariant content partition up to an invertible mapping in both generative and discriminative settings. We find numerical simulations with dependent latent variables are consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of high-dimensional, visually complex images with rich causal dependencies, which we use to study the effect of data augmentations performed in practice.
    Reusing Auto-Schedules for Efficient DNN Compilation. (arXiv:2201.05587v1 [cs.LG])
    (0 min) Auto-scheduling is a process where a search algorithm automatically explores candidate schedules (program transformations) for a given tensor program on a given hardware platform to improve its performance. However this can be a very time consuming process, depending on the complexity of the tensor program, and capacity of the target device, with often many thousands of program variants being explored. To address this, in this paper we introduce and demonstrate the idea of \emph{tuning-reuse}, a novel approach to identify and re-use auto-schedules between tensor programs. We demonstrate this concept using Deep Neural Networks (DNNs), taking sets of auto-schedules from pre-tuned DNNs, and using them to reduce the inference time of a new DNN. Given a set of pre-tuned schedules, tuning-reuse provides its maximum speedup in less time than auto-scheduling using the state-of-the-art Ansor auto-scheduler. On a set of widely used DNN models, we apply tuning-reuse and achieve maximum speedups between $1.16\times$ and $4.76\times$, while outperforming Ansor when given limited tuning time.
    Study of Frequency domain exponential functional link network filters. (arXiv:2201.05501v1 [eess.SP])
    (0 min) The exponential functional link network (EFLN) filter has attracted tremendous interest due to its enhanced nonlinear modeling capability. However, the computational complexity will dramatically increase with the dimension growth of the EFLN-based filter. To improve the computational efficiency, we propose a novel frequency domain exponential functional link network (FDEFLN) filter in this paper. The idea is to organize the samples in blocks of expanded input data, transform them from time domain to frequency domain, and thus execute the filtering and adaptation procedures in frequency domain with the overlap-save method. A FDEFLN-based nonlinear active noise control (NANC) system has also been developed to form the frequency domain exponential filtered-s least mean-square (FDEFsLMS) algorithm. Moreover, the stability, steady-state performance and computational complexity of algorithms are analyzed. Finally, several numerical experiments corroborate the proposed FDEFLN-based algorithms in nonlinear system identification, acoustic echo cancellation and NANC implementations, which demonstrate much better computational efficiency.
    Waveform Learning for Reduced Out-of-Band Emissions Under a Nonlinear Power Amplifier. (arXiv:2201.05524v1 [eess.SP])
    (0 min) Machine learning (ML) has shown great promise in optimizing various aspects of the physical layer processing in wireless communication systems. In this paper, we use ML to learn jointly the transmit waveform and the frequency-domain receiver. In particular, we consider a scenario where the transmitter power amplifier is operating in a nonlinear manner, and ML is used to optimize the waveform to minimize the out-of-band emissions. The system also learns a constellation shape that facilitates pilotless detection by the simultaneously learned receiver. The simulation results show that such an end-to-end optimized system can communicate data more accurately and with less out-of-band emissions than conventional systems, thereby demonstrating the potential of ML in optimizing the air interface. To the best of our knowledge, there are no prior works considering the power amplifier induced emissions in an end-to-end learned system. These findings pave the way towards an ML-native air interface, which could be one of the building blocks of 6G.
    DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. (arXiv:2201.05596v1 [cs.LG])
    (0 min) As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.
    Generalized Approach to Matched Filtering using Neural Networks. (arXiv:2104.03961v2 [astro-ph.IM] UPDATED)
    (0 min) Gravitational wave science is a pioneering field with rapidly evolving data analysis methodology currently assimilating and inventing deep learning techniques. The bulk of the sophisticated flagship searches of the field rely on the time-tested matched filtering principle within their core. In this paper, we make a key observation on the relationship between the emerging deep learning and the traditional techniques: matched filtering is formally equivalent to a particular neural network. This means that a neural network can be constructed analytically to exactly implement matched filtering, and can be further trained on data or boosted with additional complexity for improved performance. Moreover, we show that the proposed neural network architecture can outperform matched filtering, both with or without knowledge of a prior on the parameter distribution. When a prior is given, the proposed neural network can approach the statistically optimal performance. We also propose and investigate two different neural network architectures MNet-Shallow and MNet-Deep, both of which implement matched filtering at initialization and can be trained on data. MNet-Shallow has simpler structure, while MNet-Deep is more flexible and can deal with a wider range of distributions. Our theoretical findings are corroborated by experiments using real LIGO data and synthetic injections, where our proposed methods significantly outperform matched filtering at false positive rates above $5\times 10^{-3}\%$. The fundamental equivalence between matched filtering and neural networks allows us to define a "complexity standard candle" to characterize the relative complexity of the different approaches to gravitational wave signal searches in a common framework. Finally, our results suggest new perspectives on the role of deep learning in gravitational wave detection.
    Reinforcement Learning to Solve NP-hard Problems: an Application to the CVRP. (arXiv:2201.05393v1 [cs.AI])
    (0 min) In this paper, we evaluate the use of Reinforcement Learning (RL) to solve a classic combinatorial optimization problem: the Capacitated Vehicle Routing Problem (CVRP). We formalize this problem in the RL framework and compare two of the most promising RL approaches with traditional solving techniques on a set of benchmark instances. We measure the different approaches with the quality of the solution returned and the time required to return it. We found that despite not returning the best solution, the RL approach has many advantages over traditional solvers. First, the versatility of the framework allows the resolution of more complex combinatorial problems. Moreover, instead of trying to solve a specific instance of the problem, the RL algorithm learns the skills required to solve the problem. The trained policy can then quasi instantly provide a solution to an unseen problem without having to solve it from scratch. Finally, the use of trained models makes the RL solver by far the fastest, and therefore make this approach more suited for commercial use where the user experience is paramount. Techniques like Knowledge Transfer can also be used to improve the training efficiency of the algorithm and help solve bigger and more complex problems.
    Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks. (arXiv:2110.09548v3 [cs.LG] UPDATED)
    (0 min) Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, we show that the computational complexity required to globally optimize the equivalent convex problem is fully polynomial-time in feature dimension and number of samples. Therefore, we prove polynomial-time trainability of path regularized ReLU networks with global optimality guarantees. We also provide several numerical experiments corroborating our theory.
    Reinforcement Learning in Time-Varying Systems: an Empirical Study. (arXiv:2201.05560v1 [cs.LG])
    (0 min) Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity and develop a framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (1) identifies different environments encountered by the live system, (2) explores and trains a separate expert policy for each environment, and (3) employs safeguards to protect the system's performance. We apply our framework to two systems problems: straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that each component of our framework is necessary to cope with non-stationarity.
    Unsupervised Temporal Video Grounding with Deep Semantic Clustering. (arXiv:2201.05307v1 [cs.CV])
    (0 min) Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.
    A causal view on compositional data. (arXiv:2106.11234v2 [cs.LG] UPDATED)
    (0 min) Many scientific datasets are compositional in nature. Important examples include species abundances in ecology, rock compositions in geology, topic compositions in large-scale text corpora, and sequencing count data in molecular biology. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. First, we crisply articulate potential pitfalls for practitioners regarding the interpretation of compositional causes from the viewpoint of interventions and warn against attributing causal meaning to common summary statistics such as diversity indices. We then advocate for and develop multivariate methods using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account. In a comparative analysis on synthetic and real data we show the advantages and limitations of our proposal. We posit that our framework provides a useful starting point and guidance for valid and informative cause-effect estimation in the context of compositional data.
    A Kernel-Expanded Stochastic Neural Network. (arXiv:2201.05319v1 [stat.ML])
    (0 min) The deep neural network suffers from many fundamental issues in machine learning. For example, it often gets trapped into a local minimum in training, and its prediction uncertainty is hard to be assessed. To address these issues, we propose the so-called kernel-expanded stochastic neural network (K-StoNet) model, which incorporates support vector regression (SVR) as the first hidden layer and reformulates the neural network as a latent variable model. The former maps the input vector into an infinite dimensional feature space via a radial basis function (RBF) kernel, ensuring absence of local minima on its training loss surface. The latter breaks the high-dimensional nonconvex neural network training problem into a series of low-dimensional convex optimization problems, and enables its prediction uncertainty easily assessed. The K-StoNet can be easily trained using the imputation-regularized optimization (IRO) algorithm. Compared to traditional deep neural networks, K-StoNet possesses a theoretical guarantee to asymptotically converge to the global optimum and enables the prediction uncertainty easily assessed. The performances of the new model in training, prediction and uncertainty quantification are illustrated by simulated and real data examples.
    Examining and Mitigating the Impact of Crossbar Non-idealities for Accurate Implementation of Sparse Deep Neural Networks. (arXiv:2201.05229v1 [cs.LG])
    (0 min) Recently several structured pruning techniques have been introduced for energy-efficient implementation of Deep Neural Networks (DNNs) with lesser number of crossbars. Although, these techniques have claimed to preserve the accuracy of the sparse DNNs on crossbars, none have studied the impact of the inexorable crossbar non-idealities on the actual performance of the pruned networks. To this end, we perform a comprehensive study to show how highly sparse DNNs, that result in significant crossbar-compression-rate, can lead to severe accuracy losses compared to unpruned DNNs mapped onto non-ideal crossbars. We perform experiments with multiple structured-pruning approaches (such as, C/F pruning, XCS and XRS) on VGG11 and VGG16 DNNs with benchmark datasets (CIFAR10 and CIFAR100). We propose two mitigation approaches - Crossbar column rearrangement and Weight-Constrained-Training (WCT) - that can be integrated with the crossbar-mapping of the sparse DNNs to minimize accuracy losses incurred by the pruned models. These help in mitigating non-idealities by increasing the proportion of low conductance synapses on crossbars, thereby improving their computational accuracies.
    Importance sampling based active learning for parametric seismic fragility curve estimation. (arXiv:2109.04323v2 [stat.ML] UPDATED)
    (0 min) The key elements of seismic probabilistic risk assessment studies are the fragility curves which express the probabilities of failure of structures conditional to a seismic intensity measure. A multitude of procedures is currently available to estimate these curves. For modeling-based approaches which may involve complex and expensive numerical models, the main challenge is to optimize the calls to the numerical codes to reduce the estimation costs. Adaptive techniques can be used for this purpose, but in doing so, taking into account the uncertainties of the estimates (via confidence intervals or ellipsoids related to the size of the samples used) is an arduous task because the samples are no longer independent and possibly not identically distributed. The main contribution of this work is to deal with this question in a mathematical and rigorous way. To this end, we propose and implement an active learning methodology based on adaptive importance sampling for parametric estimations of fragility curves. We prove some theoretical properties (consistency and asymptotic normality) for the estimator of interest. Moreover, we give a convergence criterion in order to use asymptotic confidence ellipsoids. Finally, the performances of the methodology are evaluated on analytical and industrial test cases of increasing complexity.
    SPLDExtraTrees: Robust machine learning approach for predicting kinase inhibitor resistance. (arXiv:2111.08008v4 [q-bio.QM] UPDATED)
    (0 min) Drug resistance is a major threat to the global health and a significant concern throughout the clinical treatment of diseases and drug development. The mutation in proteins that is related to drug binding is a common cause for adaptive drug resistance. Therefore, quantitative estimations of how mutations would affect the interaction between a drug and the target protein would be of vital significance for the drug development and the clinical practice. Computational methods that rely on molecular dynamics simulations, Rosetta protocols, as well as machine learning methods have been proven to be capable of predicting ligand affinity changes upon protein mutation. However, the severely limited sample size and heavy noise induced overfitting and generalization issues have impeded wide adoption of machine learning for studying drug resistance. In this paper, we propose a robust machine learning method, termed SPLDExtraTrees, which can accurately predict ligand binding affinity changes upon protein mutation and identify resistance-causing mutations. Especially, the proposed method ranks training data following a specific scheme that starts with easy-to-learn samples and gradually incorporates harder and diverse samples into the training, and then iterates between sample weight recalculations and model updates. In addition, we calculate additional physics-based structural features to provide the machine learning model with the valuable domain knowledge on proteins for this data-limited predictive tasks. The experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios, and achieves predictive accuracy comparable to that of molecular dynamics and Rosetta methods with much less computational costs.
    Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning. (arXiv:2201.05433v1 [cs.LG])
    (0 min) Offline reinforcement learning (RL) Algorithms are often designed with environments such as MuJoCo in mind, in which the planning horizon is extremely long and no noise exists. We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states. We find that on the IB, hybrid approaches face severe difficulties and that simpler algorithms, such as rollout based algorithms or model-free algorithms with simpler regularizers perform best on the datasets.
    Combining No-regret and Q-learning. (arXiv:1910.03094v3 [cs.LG] UPDATED)
    (0 min) Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average.
    PDO-eS2CNNs: Partial Differential Operator Based Equivariant Spherical CNNs. (arXiv:2104.03584v2 [cs.CV] UPDATED)
    (0 min) Spherical signals exist in many applications, e.g., planetary data, LiDAR scans and digitalization of 3D objects, calling for models that can process spherical data effectively. It does not perform well when simply projecting spherical data into the 2D plane and then using planar convolution neural networks (CNNs), because of the distortion from projection and ineffective translation equivariance. Actually, good principles of designing spherical CNNs are avoiding distortions and converting the shift equivariance property in planar CNNs to rotation equivariance in the spherical domain. In this work, we use partial differential operators (PDOs) to design a spherical equivariant CNN, PDO-eS2CNN, which is exactly rotation equivariant in the continuous domain. We then discretize PDO-eS2CNNs, and analyze the equivariance error resulted from discretization. This is the first time that the equivariance error is theoretically analyzed in the spherical domain. In experiments, PDO-eS2CNNs show greater parameter efficiency and outperform other spherical CNNs significantly on several tasks.
    Generic Itemset Mining Based on Reinforcement Learning. (arXiv:2105.07753v2 [cs.DB] UPDATED)
    (0 min) One of the biggest problems in itemset mining is the requirement of developing a data structure or algorithm, every time a user wants to extract a different type of itemsets. To overcome this, we propose a method, called Generic Itemset Mining based on Reinforcement Learning (GIM-RL), that offers a unified framework to train an agent for extracting any type of itemsets. In GIM-RL, the environment formulates iterative steps of extracting a target type of itemsets from a dataset. At each step, an agent performs an action to add or remove an item to or from the current itemset, and then obtains from the environment a reward that represents how relevant the itemset resulting from the action is to the target type. Through numerous trial-and-error steps where various rewards are obtained by diverse actions, the agent is trained to maximise cumulative rewards so that it acquires the optimal action policy for forming as many itemsets of the target type as possible. In this framework, an agent for extracting any type of itemsets can be trained as long as a reward suitable for the type can be defined. The extensive experiments on mining high utility itemsets, frequent itemsets and association rules show the general effectiveness and one remarkable potential (agent transfer) of GIM-RL. We hope that GIM-RL opens a new research direction towards learning-based itemset mining.
    HYLDA: End-to-end Hybrid Learning Domain Adaptation for LiDAR Semantic Segmentation. (arXiv:2201.05585v1 [cs.CV])
    (0 min) In this paper we address the problem of training a LiDAR semantic segmentation network using a fully-labeled source dataset and a target dataset that only has a small number of labels. To this end, we develop a novel image-to-image translation engine, and couple it with a LiDAR semantic segmentation network, resulting in an integrated domain adaptation architecture we call HYLDA. To train the system end-to-end, we adopt a diverse set of learning paradigms, including 1) self-supervision on a simple auxiliary reconstruction task, 2) semi-supervised training using a few available labeled target domain frames, and 3) unsupervised training on the fake translated images generated by the image-to-image translation stage, together with the labeled frames from the source domain. In the latter case, the semantic segmentation network participates in the updating of the image-to-image translation engine. We demonstrate experimentally that HYLDA effectively addresses the challenging problem of improving generalization on validation data from the target domain when only a few target labeled frames are available for training. We perform an extensive evaluation where we compare HYLDA against strong baseline methods using two publicly available LiDAR semantic segmentation datasets.
    How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces. (arXiv:2201.05214v1 [cs.LG])
    (0 min) The failure of the Euclidean norm to reliably distinguish between nearby and distant points in high dimensional space is well-known. This phenomenon of distance concentration manifests in a variety of data distributions, with iid or correlated features, including centrally-distributed and clustered data. Unsupervised learning based on Euclidean nearest-neighbors and more general proximity-oriented data mining tasks like clustering, might therefore be adversely affected by distance concentration for high-dimensional applications. While considerable work has been done developing clustering algorithms with reliable high-dimensional performance, the problem of cluster validation--of determining the natural number of clusters in a dataset--has not been carefully examined in high-dimensional problems. In this work we investigate how the sensitivities of common Euclidean norm-based cluster validity indices scale with dimension for a variety of synthetic data schemes, including well-separated and noisy clusters, and find that the overwhelming majority of indices have improved or stable sensitivity in high dimensions. The curse of dimensionality is therefore dispelled for this class of fairly generic data schemes.
    Voice2Series: Reprogramming Acoustic Models for Time Series Classification. (arXiv:2106.09296v3 [cs.LG] UPDATED)
    (2 min) Learning to classify time series with limited data is a practical yet challenging problem. Current methods are primarily based on hand-designed feature extraction rules or domain-specific data augmentation. Motivated by the advances in deep speech processing models and the fact that voice data are univariate temporal signals, in this paper, we propose Voice2Series (V2S), a novel end-to-end approach that reprograms acoustic models for time series classification, through input transformation learning and output label mapping. Leveraging the representation learning power of a large-scale pre-trained speech processing model, on 30 different time series tasks we show that V2S performs competitive results on 19 time series classification tasks. We further provide a theoretical justification of V2S by proving its population risk is upper bounded by the source risk and a Wasserstein distance accounting for feature alignment via reprogramming. Our results offer new and effective means to time series classification.
    Parallel Neural Local Lossless Compression. (arXiv:2201.05213v1 [eess.IV])
    (2 min) The recently proposed Neural Local Lossless Compression (NeLLoC), which is based on a local autoregressive model, has achieved state-of-the-art (SOTA) out-of-distribution (OOD) generalization performance in the image compression task. In addition to the encouragement of OOD generalization, the local model also allows parallel inference in the decoding stage. In this paper, we propose a parallelization scheme for local autoregressive models. We discuss the practicalities of implementing this scheme, and provide experimental evidence of significant gains in compression runtime compared to the previous, non-parallel implementation.
    Adaptive Transfer Learning for Plant Phenotyping. (arXiv:2201.05261v1 [cs.LG])
    (2 min) Plant phenotyping (Guo et al. 2021; Pieruschka et al. 2019) focuses on studying the diverse traits of plants related to the plants' growth. To be more specific, by accurately measuring the plant's anatomical, ontogenetical, physiological and biochemical properties, it allows identifying the crucial factors of plants' growth in different environments. One commonly used approach is to predict the plant's traits using hyperspectral reflectance (Yendrek et al. 2017; Wang et al. 2021). However, the data distributions of the hyperspectral reflectance data in plant phenotyping might vary in different environments for different plants. That is, it would be computationally expansive to learn the machine learning models separately for one plant in different environments. To solve this problem, we focus on studying the knowledge transferability of modern machine learning models in plant phenotyping. More specifically, this work aims to answer the following questions. (1) How is the performance of conventional machine learning models, e.g., partial least squares regression (PLSR), Gaussian process regression (GPR) and multi-layer perceptron (MLP), affected by the number of annotated samples for plant phenotyping? (2) Whether could the neural network based transfer learning models improve the performance of plant phenotyping? (3) Could the neural network based transfer learning be improved by using infinite-width hidden layers for plant phenotyping?
    A Survey of Knowledge-Enhanced Text Generation. (arXiv:2010.04389v3 [cs.CL] UPDATED)
    (2 min) The goal of text generation is to make machines express in human language. It is one of the most important yet challenging tasks in natural language processing (NLP). Since 2014, various neural encoder-decoder models pioneered by Seq2Seq have been proposed to achieve the goal by learning to map input text to output text. However, the input text alone often provides limited knowledge to generate the desired output, so the performance of text generation is still far from satisfaction in many real-world scenarios. To address this issue, researchers have considered incorporating various forms of knowledge beyond the input text into the generation models. This research direction is known as knowledge-enhanced text generation. In this survey, we present a comprehensive review of the research on knowledge enhanced text generation over the past five years. The main content includes two parts: (i) general methods and architectures for integrating knowledge into text generation; (ii) specific techniques and applications according to different forms of knowledge data. This survey can have broad audiences, researchers and practitioners, in academia and industry.
    De Rham compatible Deep Neural Networks. (arXiv:2201.05395v1 [math.NA])
    (2 min) We construct several classes of neural networks with ReLU and BiSU (Binary Step Unit) activations, which exactly emulate the lowest order Finite Element (FE) spaces on regular, simplicial partitions of polygonal and polyhedral domains $\Omega \subset \mathbb{R}^d$, $d=2,3$. For continuous, piecewise linear (CPwL) functions, our constructions generalize previous results in that arbitrary, regular simplicial partitions of $\Omega$ are admitted, also in arbitrary dimension $d\geq 2$. Vector-valued elements emulated include the classical Raviart-Thomas and the first family of N\'{e}d\'{e}lec edge elements on triangles and tetrahedra. Neural Networks emulating these FE spaces are required in the correct approximation of boundary value problems of electromagnetism in nonconvex polyhedra $\Omega \subset \mathbb{R}^3$, thereby constituting an essential ingredient in the application of e.g. the methodology of ``physics-informed NNs'' or ``deep Ritz methods'' to electromagnetic field simulation via deep learning techniques. They satisfy exact (De Rham) sequence properties, and also spawn discrete boundary complexes on $\partial\Omega$ which satisfy exact sequence properties for the surface divergence and curl operators $\mathrm{div}_\Gamma$ and $\mathrm{curl}_\Gamma$, respectively, thereby enabling ``neural boundary elements'' for computational electromagnetism. We indicate generalizations of our constructions to higher-order compatible spaces and other, non-compatible classes of discretizations in particular the Crouzeix-Raviart elements and Hybridized, Higher Order (HHO) methods.
    DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation. (arXiv:2201.05256v1 [cs.SE])
    (2 min) The task of finding the best developer to fix a bug is called bug triage. Most of the existing approaches consider the bug triage task as a classification problem, however, classification is not appropriate when the sets of classes change over time (as developers often do in a project). Furthermore, to the best of our knowledge, all the existing models use textual sources of information, i.e., bug descriptions, which are not always available. In this work, we explore the applicability of existing solutions for the bug triage problem when stack traces are used as the main data source of bug reports. Additionally, we reformulate this task as a ranking problem and propose new deep learning models to solve it. The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network, with the weights of the models optimized using a ranking loss function. To improve the quality of ranking, we propose using additional information from version control system annotations. Two approaches are proposed for extracting features from annotations: manual and using an additional neural network. To evaluate our models, we collected two datasets of real-world stack traces. Our experiments show that the proposed models outperform existing models adapted to handle stack traces. To facilitate further research in this area, we publish the source code of our models and one of the collected datasets.
    Position-enhanced and Time-aware Graph Convolutional Network for Sequential Recommendations. (arXiv:2107.05235v2 [cs.IR] UPDATED)
    (2 min) Most of the existing deep learning-based sequential recommendation approaches utilize the recurrent neural network architecture or self-attention to model the sequential patterns and temporal influence among a user's historical behavior and learn the user's preference at a specific time. However, these methods have two main drawbacks. First, they focus on modeling users' dynamic states from a user-centric perspective and always neglect the dynamics of items over time. Second, most of them deal with only the first-order user-item interactions and do not consider the high-order connectivity between users and items, which has recently been proved helpful for the sequential recommendation. To address the above problems, in this article, we attempt to model user-item interactions by a bipartite graph structure and propose a new recommendation approach based on a Position-enhanced and Time-aware Graph Convolutional Network (PTGCN) for the sequential recommendation. PTGCN models the sequential patterns and temporal dynamics between user-item interactions by defining a position-enhanced and time-aware graph convolution operation and learning the dynamic representations of users and items simultaneously on the bipartite graph with a self-attention aggregator. Also, it realizes the high-order connectivity between users and items by stacking multi-layer graph convolutions. To demonstrate the effectiveness of PTGCN, we carried out a comprehensive evaluation of PTGCN on three real-world datasets of different sizes compared with a few competitive baselines. Experimental results indicate that PTGCN outperforms several state-of-the-art models in terms of two commonly-used evaluation metrics for ranking.
    Training Free Graph Neural Networks for Graph Matching. (arXiv:2201.05349v1 [cs.LG])
    (2 min) We present TFGM (Training Free Graph Matching), a framework to boost the performance of Graph Neural Networks (GNNs) based graph matching without training. TFGM sidesteps two crucial problems when training GNNs: 1) the limited supervision due to expensive annotation, and 2) training's computational cost. A basic framework, BasicTFGM, is first proposed by adopting the inference stage of graph matching methods. Our analysis shows that the BasicTFGM is a linear relaxation to the quadratic assignment formulation of graph matching. This guarantees the preservation of structure compatibility and an efficient polynomial complexity. Empirically, we further improve the BasicTFGM by handcrafting two types of matching priors into the architecture of GNNs: comparing node neighborhoods of different localities and utilizing annotation data if available. For evaluation, we conduct extensive experiments on a broad set of settings, including supervised keypoint matching between images, semi-supervised entity alignment between knowledge graphs, and unsupervised alignment between protein interaction networks. Applying TFGM on various GNNs shows promising improvements over baselines. Further ablation studies demonstrate the effective and efficient training-free property of TFGM. Our code is available at https://github.com/acharkq/Training-Free-Graph-Matching.
    A causal model of safety assurance for machine learning. (arXiv:2201.05451v1 [cs.SE])
    (2 min) This paper proposes a framework based on a causal model of safety upon which effective safety assurance cases for ML-based applications can be built. In doing so, we build upon established principles of safety engineering as well as previous work on structuring assurance arguments for ML. The paper defines four categories of safety case evidence and a structured analysis approach within which these evidences can be effectively combined. Where appropriate, abstract formalisations of these contributions are used to illustrate the causalities they evaluate, their contributions to the safety argument and desirable properties of the evidences. Based on the proposed framework, progress in this area is re-evaluated and a set of future research directions proposed in order for tangible progress in this field to be made.
    Leveraging Intrinsic Gradient Information for Further Training of Differentiable Machine Learning Models. (arXiv:2112.00094v2 [cs.LG] UPDATED)
    (2 min) Designing models that produce accurate predictions is the fundamental objective of machine learning (ML). This work presents methods demonstrating that when the derivatives of target variables (outputs) with respect to inputs can be extracted from processes of interest, e.g., neural networks (NN) based surrogate models, they can be leveraged to further improve the accuracy of differentiable ML models. This paper generalises the idea and provides practical methodologies that can be used to leverage gradient information (GI) across a variety of applications including: (1) Improving the performance of generative adversarial networks (GANs); (2) efficiently tuning NN model complexity; (3) regularising linear regressions. Numerical results show that GI can effective enhance ML models with existing datasets, demonstrating its value for a variety of applications.
    LeapfrogLayers: A Trainable Framework for Effective Topological Sampling. (arXiv:2112.01582v2 [hep-lat] UPDATED)
    (2 min) We introduce LeapfrogLayers, an invertible neural network architecture that can be trained to efficiently sample the topology of a 2D $U(1)$ lattice gauge theory. We show an improvement in the integrated autocorrelation time of the topological charge when compared with traditional HMC, and look at how different quantities transform under our model. Our implementation is open source, and is publicly available on github at https://github.com/saforem2/l2hmc-qcd.
    Cross-token Modeling with Conditional Computation. (arXiv:2109.02008v3 [cs.LG] UPDATED)
    (2 min) Mixture-of-Experts (MoE), a conditional computation architecture, achieved promising performance by scaling local module (i.e. feed-forward network) of transformer. However, scaling the cross-token module (i.e. self-attention) is challenging due to the unstable training. This work proposes Sparse-MLP, an all-MLP model which applies sparsely-activated MLPs to cross-token modeling. Specifically, in each Sparse block of our all-MLP model, we apply two stages of MoE layers: one with MLP experts mixing information within channels along image patch dimension, the other with MLP experts mixing information within patches along the channel dimension. In addition, by proposing importance-score routing strategy for MoE and redesigning the image representation shape, we further improve our model's computational efficiency. Experimentally, we are more computation-efficient than Vision Transformers with comparable accuracy. Also, our models can outperform MLP-Mixer by 2.5\% on ImageNet Top-1 accuracy with fewer parameters and computational cost. On downstream tasks, i.e. Cifar10 and Cifar100, our models can still achieve better performance than baselines.
    Estimating Gaussian Copulas with Missing Data. (arXiv:2201.05565v1 [stat.ML])
    (2 min) In this work we present a rigorous application of the Expectation Maximization algorithm to determine the marginal distributions and the dependence structure in a Gaussian copula model with missing data. We further show how to circumvent a priori assumptions on the marginals with semiparametric modelling. The joint distribution learned through this algorithm is considerably closer to the underlying distribution than existing methods.
    SympOCnet: Solving optimal control problems with applications to high-dimensional multi-agent path planning problems. (arXiv:2201.05475v1 [math.OC])
    (2 min) Solving high-dimensional optimal control problems in real-time is an important but challenging problem, with applications to multi-agent path planning problems, which have drawn increased attention given the growing popularity of drones in recent years. In this paper, we propose a novel neural network method called SympOCnet that applies the Symplectic network to solve high-dimensional optimal control problems with state constraints. We present several numerical results on path planning problems in two-dimensional and three-dimensional spaces. Specifically, we demonstrate that our SympOCnet can solve a problem with more than 500 dimensions in 1.5 hours on a single GPU, which shows the effectiveness and efficiency of SympOCnet. The proposed method is scalable and has the potential to solve truly high-dimensional path planning problems in real-time.
    Manifoldron: Direct Space Partition via Manifold Discovery. (arXiv:2201.05279v1 [cs.LG])
    (2 min) A neural network with the widely-used ReLU activation has been shown to partition the sample space into many convex polytopes for prediction. However, the parameterized way a neural network and other machine learning models use to partition the space has imperfections, e.g., the compromised interpretability for complex models, the inflexibility in decision boundary construction due to the generic character of the model, and the risk of being trapped into shortcut solutions. In contrast, although the non-parameterized models can adorably avoid or downplay these issues, they are usually insufficiently powerful either due to over-simplification or the failure to accommodate the manifold structures of data. In this context, we first propose a new type of machine learning models referred to as Manifoldron that directly derives decision boundaries from data and partitions the space via manifold structure discovery. Then, we systematically analyze the key characteristics of the Manifoldron including interpretability, manifold characterization capability, and its link to neural networks. The experimental results on 9 small and 11 large datasets demonstrate that the proposed Manifoldron performs competitively compared to the mainstream machine learning models. We have shared our code https://github.com/wdayang/Manifoldron for free download and evaluation.
    N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks. (arXiv:2112.06397v2 [cs.GR] UPDATED)
    (2 min) We present a novel mesh-based learning approach (N-Cloth) for plausible 3D cloth deformation prediction. Our approach is general and can handle cloth or obstacles represented by triangle meshes with arbitrary topologies. We use graph convolution to transform the cloth and object meshes into a latent space to reduce the non-linearity in the mesh space. Our network can predict the target 3D cloth mesh deformation based on the initial state of the cloth mesh template and the target obstacle mesh. Our approach can handle complex cloth meshes with up to 100K triangles and scenes with various objects corresponding to SMPL humans, non-SMPL humans or rigid bodies. In practice, our approach can be used to generate plausible cloth simulation at 30-45 fps on an NVIDIA GeForce RTX 3090 GPU. We highlight its benefits over prior learning-based methods and physically-based cloth simulators.
    Multi-head Temporal Attention-Augmented Bilinear Network for Financial time series prediction. (arXiv:2201.05459v1 [cs.LG])
    (2 min) Financial time-series forecasting is one of the most challenging domains in the field of time-series analysis. This is mostly due to the highly non-stationary and noisy nature of financial time-series data. With progressive efforts of the community to design specialized neural networks incorporating prior domain knowledge, many financial analysis and forecasting problems have been successfully tackled. The temporal attention mechanism is a neural layer design that recently gained popularity due to its ability to focus on important temporal events. In this paper, we propose a neural layer based on the ideas of temporal attention and multi-head attention to extend the capability of the underlying neural network in focusing simultaneously on multiple temporal instances. The effectiveness of our approach is validated using large-scale limit-order book market data to forecast the direction of mid-price movements. Our experiments show that the use of multi-head temporal attention modules leads to enhanced prediction performances compared to baseline models.
    Attention-Based Recommendation On Graphs. (arXiv:2201.05499v1 [cs.IR])
    (2 min) Graph Neural Networks (GNN) have shown remarkable performance in different tasks. However, there are a few studies about GNN on recommender systems. GCN as a type of GNNs can extract high-quality embeddings for different entities in a graph. In a collaborative filtering task, the core problem is to find out how informative an entity would be for predicting the future behavior of a target user. Using an attention mechanism, we can enable GCNs to do such an analysis when the underlying data is modeled as a graph. In this study, we proposed GARec as a model-based recommender system that applies an attention mechanism along with a spatial GCN on a recommender graph to extract embeddings for users and items. The attention mechanism tells GCN how much a related user or item should affect the final representation of the target entity. We compared the performance of GARec against some baseline algorithms in terms of RMSE. The presented method outperforms existing model-based, non-graph neural networks and graph neural networks in different MovieLens datasets.
    Machine Learning: The Basics. (arXiv:1805.05052v16 [cs.LG] UPDATED)
    (3 min) Machine learning (ML) has become a commodity in our every-day lives. We routinely ask ML empowered smartphones to suggest lovely food places or to guide us through a strange place. ML methods have also become standard tools in many fields of science and engineering. A plethora of ML applications transform human lives at unprecedented pace and scale. This book portrays ML as the combination of three basic components: data, model and loss. ML methods combine these three components within computationally efficient implementations of the basic scientific principle "trial and error". This principle consists of the continuous adaptation of a hypothesis about a phenomenon that generates data. ML methods use a hypothesis to compute predictions for future events. We believe that thinking about ML as combinations of three components given by data, model, and loss helps to navigate the steadily growing offer for ready-to-use ML methods. Our three-component picture of ML allows a unified treatment of a wide range of concepts and techniques which seem quite unrelated at first sight. The regularization effect of early stopping in iterative methods is due to the shrinking of the effective hypothesis space. Privacy-preserving ML is obtained by particular choices for the features of data points. Explainable ML methods are characterized by particular choices for the hypothesis space. To make good use of ML tools it is instrumental to understand its underlying principles at different levels of detail. On a lower level, this tutorial helps ML engineers to choose suitable methods for the application at hand. The book also offers a higher-level view on the implementation of ML methods which is typically required to manage a team of ML engineers and data scientists.
    Eikonal depth: an optimal control approach to statistical depths. (arXiv:2201.05274v1 [math.ST])
    (2 min) Statistical depths provide a fundamental generalization of quantiles and medians to data in higher dimensions. This paper proposes a new type of globally defined statistical depth, based upon control theory and eikonal equations, which measures the smallest amount of probability density that has to be passed through in a path to points outside the support of the distribution: for example spatial infinity. This depth is easy to interpret and compute, expressively captures multi-modal behavior, and extends naturally to data that is non-Euclidean. We prove various properties of this depth, and provide discussion of computational considerations. In particular, we demonstrate that this notion of depth is robust under an aproximate isometrically constrained adversarial model, a property which is not enjoyed by the Tukey depth. Finally we give some illustrative examples in the context of two-dimensional mixture models and MNIST.
    Recombinator-k-means: An evolutionary algorithm that exploits k-means++ for recombination. (arXiv:1905.00531v5 [cs.LG] UPDATED)
    (2 min) We introduce an evolutionary algorithm called recombinator-$k$-means for optimizing the highly non-convex kmeans problem. Its defining feature is that its crossover step involves all the members of the current generation, stochastically recombining them with a repurposed variant of the $k$-means++ seeding algorithm. The recombination also uses a reweighting mechanism that realizes a progressively sharper stochastic selection policy and ensures that the population eventually coalesces into a single solution. We compare this scheme with state-of-the-art alternative, a more standard genetic algorithm with deterministic pairwise-nearest-neighbor crossover and an elitist selection policy, of which we also provide an augmented and efficient implementation. Extensive tests on large and challenging datasets (both synthetic and real-word) show that for fixed population sizes recombinator-$k$-means is generally superior in terms of the optimization objective, at the cost of a more expensive crossover step. When adjusting the population sizes of the two algorithms to match their running times, we find that for short times the (augmented) pairwise-nearest-neighbor method is always superior, while at longer times recombinator-$k$-means will match it and, on the most difficult examples, take over. We conclude that the reweighted whole-population recombination is more costly, but generally better at escaping local minima. Moreover, it is algorithmically simpler and more general (it could be applied even to $k$-medians or $k$-medoids, for example). Our implementations are publicly available at \href{https://github.com/carlobaldassi/RecombinatorKMeans.jl}{https://github.com/carlobaldassi/RecombinatorKMeans.jl}.
    On the Estimation Bias in Double Q-Learning. (arXiv:2109.14419v3 [cs.LG] UPDATED)
    (2 min) Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operation. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal fixed points under an approximate Bellman operator. To address the concerns of converging to non-optimal stationary solutions, we propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning. This approach leverages an approximate dynamic programming to bound the target value. We extensively evaluate our proposed method in the Atari benchmark tasks and demonstrate its significant improvement over baseline algorithms.
    Contrastive Laplacian Eigenmaps. (arXiv:2201.05493v1 [cs.LG])
    (2 min) Graph contrastive learning attracts/disperses node representations for similar/dissimilar node pairs under some notion of similarity. It may be combined with a low-dimensional embedding of nodes to preserve intrinsic and structural properties of a graph. In this paper, we extend the celebrated Laplacian Eigenmaps with contrastive learning, and call them COntrastive Laplacian EigenmapS (COLES). Starting from a GAN-inspired contrastive formulation, we show that the Jensen-Shannon divergence underlying many contrastive graph embedding models fails under disjoint positive and negative distributions, which may naturally emerge during sampling in the contrastive setting. In contrast, we demonstrate analytically that COLES essentially minimizes a surrogate of Wasserstein distance, which is known to cope well under disjoint distributions. Moreover, we show that the loss of COLES belongs to the family of so-called block-contrastive losses, previously shown to be superior compared to pair-wise losses typically used by contrastive methods. We show on popular benchmarks/backbones that COLES offers favourable accuracy/scalability compared to DeepWalk, GCN, Graph2Gauss, DGI and GRACE baselines.
    Structure Enhanced Graph Neural Networks for Link Prediction. (arXiv:2201.05293v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have shown promising results in various tasks, among which link prediction is an important one. GNN models usually follow a node-centric message passing procedure that aggregates the neighborhood information to the central node recursively. Following this paradigm, features of nodes are passed through edges without caring about where the nodes are located and which role they played. However, the neglected topological information is shown to be valuable for link prediction tasks. In this paper, we propose Structure Enhanced Graph neural network (SEG) for link prediction. SEG introduces the path labeling method to capture surrounding topological information of target nodes and then incorporates the structure into an ordinary GNN model. By jointly training the structure encoder and deep GNN model, SEG fuses topological structures and node features to take full advantage of graph information. Experiments on the OGB link prediction datasets demonstrate that SEG achieves state-of-the-art results among all three public datasets.
    The Fairness Field Guide: Perspectives from Social and Formal Sciences. (arXiv:2201.05216v1 [cs.AI])
    (2 min) Over the past several years, a slew of different methods to measure the fairness of a machine learning model have been proposed. However, despite the growing number of publications and implementations, there is still a critical lack of literature that explains the interplay of fair machine learning with the social sciences of philosophy, sociology, and law. We hope to remedy this issue by accumulating and expounding upon the thoughts and discussions of fair machine learning produced by both social and formal (specifically machine learning and statistics) sciences in this field guide. Specifically, in addition to giving the mathematical and algorithmic backgrounds of several popular statistical and causal-based fair machine learning methods, we explain the underlying philosophical and legal thoughts that support them. Further, we explore several criticisms of the current approaches to fair machine learning from sociological and philosophical viewpoints. It is our hope that this field guide will help fair machine learning practitioners better understand how their algorithms align with important humanistic values (such as fairness) and how we can, as a field, design methods and metrics to better serve oppressed and marginalized populaces.
  • stat.ML updates on arXiv.org

    A Method for Controlling Extrapolation when Visualizing and Optimizing the Prediction Profiles of Statistical and Machine Learning Models. (arXiv:2201.05236v1 [stat.ML])
    (2 min) We present a novel method for controlling extrapolation in the prediction profiler in the JMP software. The prediction profiler is a graphical tool for exploring high dimensional prediction surfaces for statistical and machine learning models. The profiler contains interactive cross-sectional views, or profile traces, of the prediction surface of a model. Our method helps users avoid exploring predictions that should be considered extrapolation. It also performs optimization over a constrained factor region that avoids extrapolation using a genetic algorithm. In simulations and real world examples, we demonstrate how optimal factor settings without constraint in the profiler are frequently extrapolated, and how extrapolation control helps avoid these solutions with invalid factor settings that may not be useful to the user.
    The Implicit Regularization of Momentum Gradient Descent with Early Stopping. (arXiv:2201.05405v1 [cs.LG])
    (0 min) The study on the implicit regularization induced by gradient-based optimization is a longstanding pursuit. In the present paper, we characterize the implicit regularization of momentum gradient descent (MGD) with early stopping by comparing with the explicit $\ell_2$-regularization (ridge). In details, we study MGD in the continuous-time view, so-called momentum gradient flow (MGF), and show that its tendency is closer to ridge than the gradient descent (GD) [Ali et al., 2019] for least squares regression. Moreover, we prove that, under the calibration $t=\sqrt{2/\lambda}$, where $t$ is the time parameter in MGF and $\lambda$ is the tuning parameter in ridge regression, the risk of MGF is no more than 1.54 times that of ridge. In particular, the relative Bayes risk of MGF to ridge is between 1 and 1.035 under the optimal tuning. The numerical experiments support our theoretical results strongly.
    On Preference Learning Based on Sequential Bayesian Optimization with Pairwise Comparison. (arXiv:2103.13192v2 [cs.LG] UPDATED)
    (0 min) User preference learning is generally a hard problem. Individual preferences are typically unknown even to users themselves, while the space of choices is infinite. Here we study user preference learning from information-theoretic perspective. We model preference learning as a system with two interacting sub-systems, one representing a user with his/her preferences and another one representing an agent that has to learn these preferences. The user with his/her behaviour is modeled by a parametric preference function. To efficiently learn the preferences and reduce search space quickly, we propose the agent that interacts with the user to collect the most informative data for learning. The agent presents two proposals to the user for evaluation, and the user rates them based on his/her preference function. We show that the optimum agent strategy for data collection and preference learning is a result of maximin optimization of the normalized weighted Kullback-Leibler (KL) divergence between true and agent-assigned predictive user response distributions. The resulting value of KL-divergence, which we also call remaining system uncertainty (RSU), provides an efficient performance metric in the absence of the ground truth. This metric characterises how well the agent can predict user and, thus, the quality of the underlying learned user (preference) model. Our proposed agent comprises sequential mechanisms for user model inference and proposal generation. To infer the user model (preference function), Bayesian approximate inference is used in the agent. The data collection strategy is to generate proposals, responses to which help resolving uncertainty associated with prediction of the user responses the most. The efficiency of our approach is validated by numerical simulations.
    Generalization Error Bounds for Iterative Recovery Algorithms Unfolded as Neural Networks. (arXiv:2112.04364v2 [cs.LG] UPDATED)
    (2 min) Motivated by the learned iterative soft thresholding algorithm (LISTA), we introduce a general class of neural networks suitable for sparse reconstruction from few linear measurements. By allowing a wide range of degrees of weight-sharing between the layers, we enable a unified analysis for very different neural network types, ranging from recurrent ones to networks more similar to standard feedforward neural networks. Based on training samples, via empirical risk minimization we aim at learning the optimal network parameters and thereby the optimal network that reconstructs signals from their low-dimensional linear measurements. We derive generalization bounds by analyzing the Rademacher complexity of hypothesis classes consisting of such deep networks, that also take into account the thresholding parameters. We obtain estimates of the sample complexity that essentially depend only linearly on the number of parameters and on the depth. We apply our main result to obtain specific generalization bounds for several practical examples, including different algorithms for (implicit) dictionary learning, and convolutional neural networks.
    On the Estimation Bias in Double Q-Learning. (arXiv:2109.14419v3 [cs.LG] UPDATED)
    (2 min) Double Q-learning is a classical method for reducing overestimation bias, which is caused by taking maximum estimated values in the Bellman operation. Its variants in the deep Q-learning paradigm have shown great promise in producing reliable value prediction and improving learning performance. However, as shown by prior work, double Q-learning is not fully unbiased and suffers from underestimation bias. In this paper, we show that such underestimation bias may lead to multiple non-optimal fixed points under an approximate Bellman operator. To address the concerns of converging to non-optimal stationary solutions, we propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning. This approach leverages an approximate dynamic programming to bound the target value. We extensively evaluate our proposed method in the Atari benchmark tasks and demonstrate its significant improvement over baseline algorithms.
    Understanding Negative Samples in Instance Discriminative Self-supervised Representation Learning. (arXiv:2102.06866v5 [cs.LG] UPDATED)
    (2 min) Instance discriminative self-supervised representation learning has been attracted attention thanks to its unsupervised nature and informative feature representation for downstream tasks. In practice, it commonly uses a larger number of negative samples than the number of supervised classes. However, there is an inconsistency in the existing analysis; theoretically, a large number of negative samples degrade classification performance on a downstream supervised task, while empirically, they improve the performance. We provide a novel framework to analyze this empirical result regarding negative samples using the coupon collector's problem. Our bound can implicitly incorporate the supervised loss of the downstream task in the self-supervised loss by increasing the number of negative samples. We confirm that our proposed analysis holds on real-world benchmark datasets.
    Consistent Approximations in Composite Optimization. (arXiv:2201.05250v1 [math.OC])
    (2 min) Approximations of optimization problems arise in computational procedures and sensitivity analysis. The resulting effect on solutions can be significant, with even small approximations of components of a problem translating into large errors in the solutions. We specify conditions under which approximations are well behaved in the sense of minimizers, stationary points, and level-sets and this leads to a framework of consistent approximations. The framework is developed for a broad class of composite problems, which are neither convex nor smooth. We demonstrate the framework using examples from stochastic optimization, neural-network based machine learning, distributionally robust optimization, penalty and augmented Lagrangian methods, interior-point methods, homotopy methods, smoothing methods, extended nonlinear programming, difference-of-convex programming, and multi-objective optimization. An enhanced proximal method illustrates the algorithmic possibilities. A quantitative analysis supplements the development by furnishing rates of convergence.
    Machine Learning: The Basics. (arXiv:1805.05052v16 [cs.LG] UPDATED)
    (3 min) Machine learning (ML) has become a commodity in our every-day lives. We routinely ask ML empowered smartphones to suggest lovely food places or to guide us through a strange place. ML methods have also become standard tools in many fields of science and engineering. A plethora of ML applications transform human lives at unprecedented pace and scale. This book portrays ML as the combination of three basic components: data, model and loss. ML methods combine these three components within computationally efficient implementations of the basic scientific principle "trial and error". This principle consists of the continuous adaptation of a hypothesis about a phenomenon that generates data. ML methods use a hypothesis to compute predictions for future events. We believe that thinking about ML as combinations of three components given by data, model, and loss helps to navigate the steadily growing offer for ready-to-use ML methods. Our three-component picture of ML allows a unified treatment of a wide range of concepts and techniques which seem quite unrelated at first sight. The regularization effect of early stopping in iterative methods is due to the shrinking of the effective hypothesis space. Privacy-preserving ML is obtained by particular choices for the features of data points. Explainable ML methods are characterized by particular choices for the hypothesis space. To make good use of ML tools it is instrumental to understand its underlying principles at different levels of detail. On a lower level, this tutorial helps ML engineers to choose suitable methods for the application at hand. The book also offers a higher-level view on the implementation of ML methods which is typically required to manage a team of ML engineers and data scientists.
    Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. (arXiv:2106.04619v4 [stat.ML] UPDATED)
    (2 min) Self-supervised representation learning has shown remarkable success in a number of domains. A common practice is to perform data augmentation via hand-crafted transformations intended to leave the semantics of the data invariant. We seek to understand the empirical success of this approach from a theoretical perspective. We formulate the augmentation process as a latent variable model by postulating a partition of the latent representation into a content component, which is assumed invariant to augmentation, and a style component, which is allowed to change. Unlike prior work on disentanglement and independent component analysis, we allow for both nontrivial statistical and causal dependencies in the latent space. We study the identifiability of the latent representation based on pairs of views of the observations and prove sufficient conditions that allow us to identify the invariant content partition up to an invertible mapping in both generative and discriminative settings. We find numerical simulations with dependent latent variables are consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of high-dimensional, visually complex images with rich causal dependencies, which we use to study the effect of data augmentations performed in practice.
    Estimating Gaussian Copulas with Missing Data. (arXiv:2201.05565v1 [stat.ML])
    (2 min) In this work we present a rigorous application of the Expectation Maximization algorithm to determine the marginal distributions and the dependence structure in a Gaussian copula model with missing data. We further show how to circumvent a priori assumptions on the marginals with semiparametric modelling. The joint distribution learned through this algorithm is considerably closer to the underlying distribution than existing methods.
    Combining No-regret and Q-learning. (arXiv:1910.03094v3 [cs.LG] UPDATED)
    (2 min) Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average.
    Hypergraph reconstruction from network data. (arXiv:2008.04948v4 [cs.SI] UPDATED)
    (2 min) Networks can describe the structure of a wide variety of complex systems by specifying which pairs of entities in the system are connected. While such pairwise representations are flexible, they are not necessarily appropriate when the fundamental interactions involve more than two entities at the same time. Pairwise representations nonetheless remain ubiquitous, because higher-order interactions are often not recorded explicitly in network data. Here, we introduce a Bayesian approach to reconstruct latent higher-order interactions from ordinary pairwise network data. Our method is based on the principle of parsimony and only includes higher-order structures when there is sufficient statistical evidence for them. We demonstrate its applicability to a wide range of datasets, both synthetic and empirical.
    Parallel Neural Local Lossless Compression. (arXiv:2201.05213v1 [eess.IV])
    (2 min) The recently proposed Neural Local Lossless Compression (NeLLoC), which is based on a local autoregressive model, has achieved state-of-the-art (SOTA) out-of-distribution (OOD) generalization performance in the image compression task. In addition to the encouragement of OOD generalization, the local model also allows parallel inference in the decoding stage. In this paper, we propose a parallelization scheme for local autoregressive models. We discuss the practicalities of implementing this scheme, and provide experimental evidence of significant gains in compression runtime compared to the previous, non-parallel implementation.
    Bayesian sense of time in biological and artificial brains. (arXiv:2201.05464v1 [q-bio.NC])
    (2 min) Enquiries concerning the underlying mechanisms and the emergent properties of a biological brain have a long history of theoretical postulates and experimental findings. Today, the scientific community tends to converge to a single interpretation of the brain's cognitive underpinnings -- that it is a Bayesian inference machine. This contemporary view has naturally been a strong driving force in recent developments around computational and cognitive neurosciences. Of particular interest is the brain's ability to process the passage of time -- one of the fundamental dimensions of our experience. How can we explain empirical data on human time perception using the Bayesian brain hypothesis? Can we replicate human estimation biases using Bayesian models? What insights can the agent-based machine learning models provide for the study of this subject? In this chapter, we review some of the recent advancements in the field of time perception and discuss the role of Bayesian processing in the construction of temporal models.
    Machine Learning for Multi-Output Regression: When should a holistic multivariate approach be preferred over separate univariate ones?. (arXiv:2201.05340v1 [stat.ML])
    (2 min) Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods. In particular, they are used for predicting univariate responses. In case of multiple outputs the question arises whether we separately fit univariate models or directly follow a multivariate approach. For the latter, several possibilities exist that are, e.g. based on modified splitting or stopping rules for multi-output regression. In this work we compare these methods in extensive simulations to help in answering the primary question when to use multivariate ensemble techniques.
    A novel notion of barycenter for probability distributions based on optimal weak mass transport. (arXiv:2102.13380v3 [stat.ML] UPDATED)
    (2 min) We introduce weak barycenters of a family of probability distributions, based on the recently developed notion of optimal weak transport of mass by Gozlanet al. (2017) and Backhoff-Veraguas et al. (2020). We provide a theoretical analysis of this object and discuss its interpretation in the light of convex ordering between probability measures. In particular, we show that, rather than averaging the input distributions in a geometric way (as the Wasserstein barycenter based on classic optimal transport does) weak barycenters extract common geometric information shared by all the input distributions, encoded as a latent random variable that underlies all of them. We also provide an iterative algorithm to compute a weak barycenter for a finite family of input distributions, and a stochastic algorithm that computes them for arbitrary populations of laws. The latter approach is particularly well suited for the streaming setting, i.e., when distributions are observed sequentially. The notion of weak barycenter and our approaches to compute it are illustrated on synthetic examples, validated on 2D real-world data and compared to standard Wasserstein barycenters.
    Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks. (arXiv:2110.09548v3 [cs.LG] UPDATED)
    (2 min) Understanding the fundamental principles behind the success of deep neural networks is one of the most important open questions in the current literature. To this end, we study the training problem of deep neural networks and introduce an analytic approach to unveil hidden convexity in the optimization landscape. We consider a deep parallel ReLU network architecture, which also includes standard deep networks and ResNets as its special cases. We then show that pathwise regularized training problems can be represented as an exact convex optimization problem. We further prove that the equivalent convex problem is regularized via a group sparsity inducing norm. Thus, a path regularized parallel ReLU network can be viewed as a parsimonious convex model in high dimensions. More importantly, we show that the computational complexity required to globally optimize the equivalent convex problem is fully polynomial-time in feature dimension and number of samples. Therefore, we prove polynomial-time trainability of path regularized ReLU networks with global optimality guarantees. We also provide several numerical experiments corroborating our theory.
    Use of High Dimensional Modeling for automatic variables selection: the best path algorithm. (arXiv:2105.03173v2 [stat.ML] UPDATED)
    (2 min) This paper presents a new algorithm for automatic variables selection. In particular, using the Graphical Models properties it is possible to develop a method that can be used in the contest of large dataset. The advantage of this algorithm is that can be combined with different forecasting models. In this research we have used the OLS method and we have compared the result with the LASSO method.
    KrigHedge: Gaussian Process Surrogates for Delta Hedging. (arXiv:2010.08407v4 [q-fin.CP] UPDATED)
    (2 min) We investigate a machine learning approach to option Greeks approximation based on Gaussian process (GP) surrogates. The method takes in noisily observed option prices, fits a nonparametric input-output map and then analytically differentiates the latter to obtain the various price sensitivities. Our motivation is to compute Greeks in cases where direct computation is expensive, such as in local volatility models, or can only ever be done approximately. We provide a detailed analysis of numerous aspects of GP surrogates, including choice of kernel family, simulation design, choice of trend function and impact of noise. We further discuss the application to Delta hedging, including a new Lemma that relates quality of the Delta approximation to discrete-time hedging loss. Results are illustrated with two extensive case studies that consider estimation of Delta, Theta and Gamma and benchmark approximation quality and uncertainty quantification using a variety of statistical metrics. Among our key take-aways are the recommendation to use Matern kernels, the benefit of including virtual training points to capture boundary conditions, and the significant loss of fidelity when training on stock-path-based datasets.
    Recombinator-k-means: An evolutionary algorithm that exploits k-means++ for recombination. (arXiv:1905.00531v5 [cs.LG] UPDATED)
    (2 min) We introduce an evolutionary algorithm called recombinator-$k$-means for optimizing the highly non-convex kmeans problem. Its defining feature is that its crossover step involves all the members of the current generation, stochastically recombining them with a repurposed variant of the $k$-means++ seeding algorithm. The recombination also uses a reweighting mechanism that realizes a progressively sharper stochastic selection policy and ensures that the population eventually coalesces into a single solution. We compare this scheme with state-of-the-art alternative, a more standard genetic algorithm with deterministic pairwise-nearest-neighbor crossover and an elitist selection policy, of which we also provide an augmented and efficient implementation. Extensive tests on large and challenging datasets (both synthetic and real-word) show that for fixed population sizes recombinator-$k$-means is generally superior in terms of the optimization objective, at the cost of a more expensive crossover step. When adjusting the population sizes of the two algorithms to match their running times, we find that for short times the (augmented) pairwise-nearest-neighbor method is always superior, while at longer times recombinator-$k$-means will match it and, on the most difficult examples, take over. We conclude that the reweighted whole-population recombination is more costly, but generally better at escaping local minima. Moreover, it is algorithmically simpler and more general (it could be applied even to $k$-medians or $k$-medoids, for example). Our implementations are publicly available at \href{https://github.com/carlobaldassi/RecombinatorKMeans.jl}{https://github.com/carlobaldassi/RecombinatorKMeans.jl}.
    Eikonal depth: an optimal control approach to statistical depths. (arXiv:2201.05274v1 [math.ST])
    (2 min) Statistical depths provide a fundamental generalization of quantiles and medians to data in higher dimensions. This paper proposes a new type of globally defined statistical depth, based upon control theory and eikonal equations, which measures the smallest amount of probability density that has to be passed through in a path to points outside the support of the distribution: for example spatial infinity. This depth is easy to interpret and compute, expressively captures multi-modal behavior, and extends naturally to data that is non-Euclidean. We prove various properties of this depth, and provide discussion of computational considerations. In particular, we demonstrate that this notion of depth is robust under an aproximate isometrically constrained adversarial model, a property which is not enjoyed by the Tukey depth. Finally we give some illustrative examples in the context of two-dimensional mixture models and MNIST.
    A Deep Generative Model for Molecule Optimization via One Fragment Modification. (arXiv:2012.04231v5 [cs.LG] UPDATED)
    (2 min) Molecule optimization is a critical step in drug development to improve desired properties of drug candidates through chemical modification. We developed a novel deep generative model Modof over molecular graphs for molecule optimization. Modof modifies a given molecule through the prediction of a single site of disconnection at the molecule and the removal and/or addition of fragments at that site. A pipeline of multiple, identical Modof models is implemented into Modof-pipe to modify an input molecule at multiple disconnection sites. Here we show that Modof-pipe is able to retain major molecular scaffolds, allow controls over intermediate optimization steps and better constrain molecule similarities. Modof-pipe outperforms the state-of-the-art methods on benchmark datasets: without molecular similarity constraints, Modof-pipe achieves 81.2% improvement in octanol-water partition coefficient penalized by synthetic accessibility and ring size; and 51.2%, 25.6% and 9.2% improvement if the optimized molecules are at least 0.2, 0.4 and 0.6 similar to those before optimization, respectively. Modof-pipe is further enhanced into Modof-pipem to allow modifying one molecule to multiple optimized ones. Modof-pipem achieves additional performance improvement as at least 17.8% better than Modof-pipe.
    Leveraging Intrinsic Gradient Information for Further Training of Differentiable Machine Learning Models. (arXiv:2112.00094v2 [cs.LG] UPDATED)
    (2 min) Designing models that produce accurate predictions is the fundamental objective of machine learning (ML). This work presents methods demonstrating that when the derivatives of target variables (outputs) with respect to inputs can be extracted from processes of interest, e.g., neural networks (NN) based surrogate models, they can be leveraged to further improve the accuracy of differentiable ML models. This paper generalises the idea and provides practical methodologies that can be used to leverage gradient information (GI) across a variety of applications including: (1) Improving the performance of generative adversarial networks (GANs); (2) efficiently tuning NN model complexity; (3) regularising linear regressions. Numerical results show that GI can effective enhance ML models with existing datasets, demonstrating its value for a variety of applications.
    E(n) Equivariant Normalizing Flows. (arXiv:2105.09016v4 [cs.LG] UPDATED)
    (2 min) This paper introduces a generative model equivariant to Euclidean symmetries: E(n) Equivariant Normalizing Flows (E-NFs). To construct E-NFs, we take the discriminative E(n) graph neural networks and integrate them as a differential equation to obtain an invertible equivariant function: a continuous-time normalizing flow. We demonstrate that E-NFs considerably outperform baselines and existing methods from the literature on particle systems such as DW4 and LJ13, and on molecules from QM9 in terms of log-likelihood. To the best of our knowledge, this is the first flow that jointly generates molecule features and positions in 3D.
    $\ell_1$-norm constrained multi-block sparse canonical correlation analysis via proximal gradient descent. (arXiv:2201.05289v1 [stat.ME])
    (2 min) Multi-block CCA constructs linear relationships explaining coherent variations across multiple blocks of data. We view the multi-block CCA problem as finding leading generalized eigenvectors and propose to solve it via a proximal gradient descent algorithm with $\ell_1$ constraint for high dimensional data. In particular, we use a decaying sequence of constraints over proximal iterations, and show that the resulting estimate is rate-optimal under suitable assumptions. Although several previous works have demonstrated such optimality for the $\ell_0$ constrained problem using iterative approaches, the same level of theoretical understanding for the $\ell_1$ constrained formulation is still lacking. We also describe an easy-to-implement deflation procedure to estimate multiple eigenvectors sequentially. We compare our proposals to several existing methods whose implementations are available on R CRAN, and the proposed methods show competitive performances in both simulations and a real data example.
    Importance sampling based active learning for parametric seismic fragility curve estimation. (arXiv:2109.04323v2 [stat.ML] UPDATED)
    (2 min) The key elements of seismic probabilistic risk assessment studies are the fragility curves which express the probabilities of failure of structures conditional to a seismic intensity measure. A multitude of procedures is currently available to estimate these curves. For modeling-based approaches which may involve complex and expensive numerical models, the main challenge is to optimize the calls to the numerical codes to reduce the estimation costs. Adaptive techniques can be used for this purpose, but in doing so, taking into account the uncertainties of the estimates (via confidence intervals or ellipsoids related to the size of the samples used) is an arduous task because the samples are no longer independent and possibly not identically distributed. The main contribution of this work is to deal with this question in a mathematical and rigorous way. To this end, we propose and implement an active learning methodology based on adaptive importance sampling for parametric estimations of fragility curves. We prove some theoretical properties (consistency and asymptotic normality) for the estimator of interest. Moreover, we give a convergence criterion in order to use asymptotic confidence ellipsoids. Finally, the performances of the methodology are evaluated on analytical and industrial test cases of increasing complexity.
    A Kernel-Expanded Stochastic Neural Network. (arXiv:2201.05319v1 [stat.ML])
    (2 min) The deep neural network suffers from many fundamental issues in machine learning. For example, it often gets trapped into a local minimum in training, and its prediction uncertainty is hard to be assessed. To address these issues, we propose the so-called kernel-expanded stochastic neural network (K-StoNet) model, which incorporates support vector regression (SVR) as the first hidden layer and reformulates the neural network as a latent variable model. The former maps the input vector into an infinite dimensional feature space via a radial basis function (RBF) kernel, ensuring absence of local minima on its training loss surface. The latter breaks the high-dimensional nonconvex neural network training problem into a series of low-dimensional convex optimization problems, and enables its prediction uncertainty easily assessed. The K-StoNet can be easily trained using the imputation-regularized optimization (IRO) algorithm. Compared to traditional deep neural networks, K-StoNet possesses a theoretical guarantee to asymptotically converge to the global optimum and enables the prediction uncertainty easily assessed. The performances of the new model in training, prediction and uncertainty quantification are illustrated by simulated and real data examples.
    Data Fusion with Latent Map Gaussian Processes. (arXiv:2112.02206v2 [stat.ML] UPDATED)
    (2 min) Multi-fidelity modeling and calibration are data fusion tasks that ubiquitously arise in engineering design. In this paper, we introduce a novel approach based on latent-map Gaussian processes (LMGPs) that enables efficient and accurate data fusion. In our approach, we convert data fusion into a latent space learning problem where the relations among different data sources are automatically learned. This conversion endows our approach with attractive advantages such as increased accuracy, reduced costs, flexibility to jointly fuse any number of data sources, and ability to visualize correlations between data sources. This visualization allows the user to detect model form errors or determine the optimum strategy for high-fidelity emulation by fitting LMGP only to the subset of the data sources that are well-correlated. We also develop a new kernel function that enables LMGPs to not only build a probabilistic multi-fidelity surrogate but also estimate calibration parameters with high accuracy and consistency. The implementation and use of our approach are considerably simpler and less prone to numerical issues compared to existing technologies. We demonstrate the benefits of LMGP-based data fusion by comparing its performance against competing methods on a wide range of examples.
    Deep Learning for Agile Effort Estimation Have We Solved the Problem Yet?. (arXiv:2201.05401v1 [cs.SE])
    (2 min) In the last decade, several studies have proposed the use of automated techniques to estimate the effort of agile software development. In this paper we perform a close replication and extension of a seminal work proposing the use of Deep Learning for agile effort estimation (namely Deep-SE), which has set the state-of-the-art since. Specifically, we replicate three of the original research questions aiming at investigating the effectiveness of Deep-SE for both within-project and cross-project effort estimation. We benchmark Deep-SE against three baseline techniques (i.e., Random, Mean and Median effort prediction) and a previously proposed method to estimate agile software project development effort (dubbed TF/IDF-SE), as done in the original study. To this end, we use both the data from the original study and a new larger dataset of 31,960 issues, which we mined from 29 open-source projects. Using more data allows us to strengthen our confidence in the results and further mitigate the threat to the external validity of the study. We also extend the original study by investigating two additional research questions. One evaluates the accuracy of Deep-SE when the training set is augmented with issues from all other projects available in the repository at the time of estimation, and the other examines whether an expensive pre-training step used by the original Deep-SE, has any beneficial effect on its accuracy and convergence speed. The results of our replication show that Deep-SE outperforms the Median baseline estimator and TF/IDF-SE in only very few cases with statistical significance (8/42 and 9/32 cases, respectively), thus confounding previous findings on the efficacy of Deep-SE. The two additional RQs revealed that neither augmenting the training set nor pre-training Deep-SE play a role in improving its accuracy and convergence speed. ...
    A causal view on compositional data. (arXiv:2106.11234v2 [cs.LG] UPDATED)
    (2 min) Many scientific datasets are compositional in nature. Important examples include species abundances in ecology, rock compositions in geology, topic compositions in large-scale text corpora, and sequencing count data in molecular biology. Here, we provide a causal view on compositional data in an instrumental variable setting where the composition acts as the cause. First, we crisply articulate potential pitfalls for practitioners regarding the interpretation of compositional causes from the viewpoint of interventions and warn against attributing causal meaning to common summary statistics such as diversity indices. We then advocate for and develop multivariate methods using statistical data transformations and regression techniques that take the special structure of the compositional sample space into account. In a comparative analysis on synthetic and real data we show the advantages and limitations of our proposal. We posit that our framework provides a useful starting point and guidance for valid and informative cause-effect estimation in the context of compositional data.
  • Google AI Blog

    Introducing StylEx: A New Approach for Visual Explanation of Classifiers
    (8 min) Posted by Oran Lang and Inbar Mosseri, Software Engineers, Google Research Neural networks can perform certain tasks remarkably well, but understanding how they reach their decisions — e.g., identifying which signals in an image cause a model to determine it to be of one class and not another — is often a mystery. Explaining a neural model’s decision process may have high social impact in certain areas, such as analysis of medical images and autonomous driving, where human oversight is critical. These insights can also be helpful in guiding health care providers, revealing model biases, providing support for downstream decision makers, and even aiding scientific discovery. Previous approaches for visual explanations of classifiers, such as attention maps (e.g., Grad-CAM), highlight whic…
  • The Official NVIDIA Blog

    Billions Served: NVIDIA Merlin Helps Fuel Clicks for Online Giants
    (4 min) Online commerce has rocketed to trillions of dollars worldwide in the past decade, serving billions of consumers. Behind the scenes of this explosive growth in online sales is personalization driven by recommender engines. Recommenders make shopping deeply personalized. While searching for products on e-commerce sites, they find you. Or suggestions can just appear. This wildly Read article > The post Billions Served: NVIDIA Merlin Helps Fuel Clicks for Online Giants appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    How well do explanation methods for machine-learning models work?
    (7 min) Researchers develop a way to test whether popular methods for understanding machine-learning models are working correctly.
  • AWS Machine Learning Blog

    Detect NLP data drift using custom Amazon SageMaker Model Monitor
    (11 min) Natural language understanding is applied in a wide range of use cases, from chatbots and virtual assistants, to machine translation and text summarization. To ensure that these applications are running at an expected level of performance, it’s important that data in the training and production environments is from the same distribution. When the data that […]
    Computer vision-based anomaly detection using Amazon Lookout for Vision and AWS Panorama
    (9 min) This is the second post in the two-part series on how Tyson Foods Inc., is using computer vision applications at the edge to automate industrial processes inside their meat processing plants. In Part 1, we discussed an inventory counting application at packaging lines built with Amazon SageMaker and AWS Panorama . In this post, we […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    How Unidentified Testing Can Help Find Bugs That Aren’t Immediately Visible?
    (3 min) Even when you believe you have everything perfect, non-functional testing might discover problems that were previously undetected.
    How the Future of the Recruitment Industry Is Going to Shape Out
    (3 min) The Recruitment Industry is going to witness a massive revolution in the coming decade.
    5 algorithms one must know to start with Quantum Computing.
    (5 min) “Life is strong and fragile. It’s a paradox… It’s both things, like quantum physics: It’s a particle and a wave at the same time. It all…
  • Artificial Intelligence

    Researchers Introduce High-Performance Deep Learning Toolbox for Genome-Scale Prediction of Protein Structure and Function
    (1 min) Gene functional annotation refers to correctly inferring a gene’s function from its sequence, a critical role in the biological sciences. The exponential expansion in the number of sequenced genomes has been fueled by dramatic breakthroughs in next-generation sequencing technologies, resulting in an increasing bottleneck of incorrect gene annotation. Computational techniques can help eliminate the gene-annotation bottleneck by taking advantage of this vast amount of data. Researchers from the Georgia Institute of Technology and the Department of Energy’s Oak Ridge National Laboratory are using supercomputing and new deep learning technologies to anticipate the structures and functions of thousands of proteins with unknown activities. The team used the summit supercomputer to accurately identify protein structures and activities across entire genomes of species. Their deep learning-based algorithms predict protein structure and function from DNA sequences, speeding up discoveries that could help enhance biotechnology, biosecurity, bioenergy, and pollution and climate change solutions. Continue Reading Paper: https://ieeexplore.ieee.org/document/9652872 submitted by /u/techsucker [link] [comments]
    How to Choose the Grid Size and Block Size for a CUDA Kernel?
    (1 min) submitted by /u/Just0by [link] [comments]
    Face Recognition During Coronavirus: Findings From the Experiments
    (0 min) With the arrival of the COVID-19 pandemic, the industry has started to question the effectiveness of face recognition on masked faces. We left the assumptions behind and conducted several experiments using CompreFace. https://exadel.com/news/face-recognition-during-coronavirus-do-masks-block-facial-recognition submitted by /u/lklimusheuskaja [link] [comments]
    Nice Explanation: Artificial Intelligence (AI) explained in 2 minutes
    (0 min) submitted by /u/Philo167 [link] [comments]
    How is AI Simplifying Software Development for Companies And Coders?
    (0 min) submitted by /u/noahthearc333 [link] [comments]
    My Sona Original Character Art
    (0 min) submitted by /u/VIRUS-AOTOXIN [link] [comments]
    Weekly China AI: Top 10 Tech Trends for 2022; How to Make Composite Image Look Authentic? Add Shadows; New Chinese Benchmark Tests NLU & NLG Models
    (1 min) submitted by /u/trcytony [link] [comments]
    Who was the first AI villain conceived?
    (3 min) For some reason I'm obsessed with AI villains. The train of thought the writers have to ride to come up with them seems to put them on a flow that creates revolutionary pieces of art, be it Ghost in the shell (inspired the Matrix!), System Shock, or 2001: A Space odyssey, to name a few. I'm curious, who would be the first AI villain conceived? I tried googling it, but to no avail. submitted by /u/Zunge [link] [comments]
    Building an Expert System
    (1 min) Seeking views on whether the GPT family is sufficiently mature to be used as the core components of an expert system (knowledge base plus inference engine), or would it be easier to go with the conventional knowledge graph and rule based inference engine? Thanks! submitted by /u/minisoo [link] [comments]
    Artificial Nightmares: Chesire Cat Fields || Pytti VQGAN AI Art Video [4K 60 FPS]
    (1 min) submitted by /u/Thenamessd [link] [comments]
  • Machine Learning

    [N] Competition on Meta-learning from Learning Curves (Prizes: 1,000$) - 3 weeks left
    (1 min) ​ https://metalearning.chalearn.org/ Link to our competition: https://codalab.lisn.upsaclay.fr/competitions/753 Prizes: 1,000$ Introduction video: https://youtu.be/1rLc9Z0vRzw Medium article: https://medium.com/@hungnm.vnu/meta-learning-from-learning-curves-ieee-wcci-2022-competition-5e1932742644 Artificial learning systems are good at learning to self-drive cars, to assist doctors to diagnose disease, to play video games, to recognize faces, to translate languages, but do poorly across many tasks. Recently, more effort has been put to recycle knowledge acquired by learning a task (e.g. recognizing objects), to learn faster and better a new task (e.g. recognizing faces), typically re-using or fine-tuning learned representations. BUT IS IT POSSIBLE TO LEARN-TO-LEARN? Namely develo…
    [D] Google is killing the Data Scientist
    (1 min) So I've noticed that AutoML is reducing the need for Data Scientists that build models. Shoving data into models can be done by anywho who knows SQL (data analyst, data engineer) and AutoML is MASSIVELY out-performing teams of data scientists. Do you think data science is overhyped right now? Do you think there's a need to build a train models when AutoML can do it better and cheaper? See: https://ilovedata.medium.com/how-google-is-killing-the-data-scientist-94e641c98e2e submitted by /u/Flat_Shower [link] [comments]
    [D] Do the advances in machine learning mean the dawn of theory-free science?
    (2 min) An interesting paper by Laura Spinney in the Guardian about the advent of post-theory or theory-free science, based only on empirical data and machine learning, particularly deep learning. She revisits the famous Chris Anderson's paper of 2008: "The end of theory" published in Wired. submitted by /u/ClaudeCoulombe [link] [comments]
    [D] Mathematics research towards machine learning
    (1 min) Hello everyone. I'm a soon to graduate undergrad student, and recently been thinking about pursuing a master's degree. I'm really interested in formal and advanced mathematics, particularly in probabilit theory, probability optimization and stochastic calculus. After my master's I still plan on working with ML for a while and wanted to know if these research areas are somewhat useful in machine learning and if I could have a research topic that i'm really interested in. submitted by /u/Bira-of-louders [link] [comments]
    [P] FFCV: Accelerated Model Training via Fast Data Loading
    (2 min) Hi r/MachineLearning! Today we released FFCV (ffcv.io), a library that speeds up machine learning model training by accelerating the data loading and processing pipeline. With FFCV we were able to: Train an ImageNet model (ResNet-18) to 66.9% accuracy *on one GPU* in 35 minutes (98¢/model on AWS) Train an ImageNet model (ResNet-50) to 75.6% accuracy *on one AWS machine* in 20 minutes ($4.5/model on AWS) Train a CIFAR-10 model on one GPU in 36 seconds (2¢/model on AWS) The best part about FFCV is how easy it is to use: chances are you can make your training code significantly faster by converting your dataset to FFCV format and changing just a few lines of code! To illustrate just how easy it is, we also made a minimal ImageNet example that gets high accuracies at SOTA speeds: https://github.com/libffcv/ffcv-imagenet. Let us know what you think! submitted by /u/andrew_ilyas [link] [comments]
    [P] Speech Separation Algorithm
    (1 min) I'm working on using an audio wav file as input that would have 2 to 3 people conversing. I have to identify the two different voices. Separate 2 voices. Convert these voices into text and have their transcripts stored. I have gone through huggingface transformers such a web2vec2 and svoice from Facebook research but I find It difficult to implement the models. Can someone guide me on approaching such tasks as I am being a beginner in audio domain of deep learning. submitted by /u/Yagna24 [link] [comments]
    [P] How do we account for ‘confounding due to indication’ while training a model?
    (2 min) Hello community, I have been a newbie to machine learning and recently got involved with building models in health outcomes research. We have a risk prediction and classification model that we trained initially using logistics regression. We found that a very important variable is significantly positively associated with outcome but in reality it should be negatively associated. We then figured the reason as it was a confounded by indication. We than eliminated the variable and then employed lasso, build a generalized boosted and random forest models. My concern is, given that this is one of the most important variable for our model from usability pov I am not very certain of dropping it away. Is there a way we can account for this confounding effect and include the variable in other algorithms? This is a retrospective big data that we are working with. Any suggestions or direction to any literature is helpful. Thank you! submitted by /u/HealthtoML [link] [comments]
    [R] Instance-segmentation combined with regression per instance
    (1 min) Hi there. For a project I'm looking into adding a regression head to a Mask R-CNN model. The aim is to predict a continuous variable for each of the detected instances. An example could be to predict the distance to the camera for each of the detected pedestrians. Or the age, weight or height. I wonder if there is any research into this. And what the name of the problem statement or related field of research is. I've already did a google scholar search, and read through all of the paperswithcode methods but could not find anything remotely like this. Most of the results relate to the regression of the bounding boxes, and not to a regression output variable. Keywords to search for, suggested readings, research groups, or anything that touches on this subject is much appreciated. submitted by /u/pietermarsman [link] [comments]
    "None of above" problem (Image Classification) [D]
    (1 min) Hello, I would like to know how do you handle the "None of above" problem in multi-class image classification. (None of above problem: If the image does not match any of the classes ) Thank you submitted by /u/CommercialWest7683 [link] [comments]
    [P] Choosing a self-supervised learning framework that's easy to use
    (1 min) I need to pretrain a spatial image classification model using self-supervised learning and would like to find the best/easiest to use implementation of some SOTA. There are a couple of factors: Tensorflow is preferred, but if there are great implementations in Pytorch, that is also fine Training on multiple GPUs needs to be supported Also, these likely don't impact the framework selection, but worth noting: ResNet will be used as the (initial) backbone The dataset is huge, 1B images can be used if necessary (if dataset size is a factor when choosing the implementation) Do you have any recommendations/links to code that can be used? Frameworks I'm initially considering are having issues: SimCLR -can be trained only on one GPU. Any known workarounds for that? BYOL - again, it seems that it's not optimized for running on multiple GPUs. submitted by /u/alkibijad [link] [comments]
    [D] Visualizing attention
    (0 min) How do you visualize attention especially when there are many tokens (256,512) with multiple layers and multiple heads? Most visualizations and frameworks I’ve tried fail when there are more than 100 tokens. This is quite important since most nlp and vision models now use multi head attention. submitted by /u/mlvpj [link] [comments]
    [N] We just got funded for an open-source project to make Metric Learning practical.
    (1 min) Hello all 👋 We are developing Qdrant, an open-source neural search engine written in Rust https://github.com/qdrant/qdrant. We are happy to announce that we have just closed our initial funding round to build our technology further and make Metric Learning practical. 🚀 Reddit and the community played an important role. Thank you! 🙏 Our vector similarity engine provides a production-ready service with a convenient API to store, search, and manage vectors (embeddings) along with the additional payload. Qdrant is tailored to extended filtering support, making it useful for all sorts of neural-network or semantic-based matching, recommendations, faceted search, and other applications. Feedback welcome! submitted by /u/devzaya [link] [comments]
    [D] I am working on streamlining computer vision data annotation and Nvidia's Transfer Learning Toolkit (TLT)
    (1 min) I am working on a project to integrate a data annotation and management tool and TLT and make a UI for it. Currently I am in brainstorming phase. I would like to know is there any such solutions out there which I can refer as a benchmark. If not is there any thoughts on how to develop such tool. Any inputs are appreciated. submitted by /u/Ahmad401 [link] [comments]
    [P] I trained every single SOTA from 2021 and accidentally got a silver medal on Kaggle
    (3 min) ![](https://i.ibb.co/gwpJXBm/lb.png) I trained every single SOTA model from 2021 and accidentally got a silver medal on an image classification competition on Kaggle recently (Pawpularity Contest). Here If you are interested The idea was to train every SOTA and then Nuke the leaderboard with 10 Billion parameters ensemble of ensembles. Some ensembles were also supplemented a bit with catboost 2nd stage model just for the "why not". Outline of the approach: https://i.ibb.co/McJ39mW/image-nuke.png This stunt was done mainly for the purpose me catching up with the current most recent SOTA vision papers. I seriously didn't try to compete on the leaderboard and never had the intention of releasing a public notebook that actually gets a silver medal. This came as a complete surprise to me! Hope the solution will be useful for many others in the future. If you got any questions or feedback, I'll be more than happy to discuss them! submitted by /u/yamqwe [link] [comments]
    Any ML engineers here that have degrees from unrelated fields? [D]
    (5 min) Been going back and forth on mastering out from my PhD program in mechanical engineering. Research mostly involves ML though. I want to be an ML engineer or ML research scientist (not sure yet), and I don’t know whether it’d be better to continue my PhD for another 2-3 years or whether to master out after this summer and then start working in an ML related role. Although, I’d probably have to take some months first to self-study some more / do more personal projects to add to my portfolio. So I’m wondering if anyone in this sub got a PhD in an unrelated field & managed to land an ML role that pays well. Would love to hear your story. Thanks! submitted by /u/Careless_Drummer_208 [link] [comments]
    [Research] Regression problems where the input shape may change.
    (2 min) Specifically I'm working on an applied ML problem where I need to create a surrogate model for optimization of just one feature. Imagine my input data is a 2D physical plane with a vector field superimposed. If the vector field is highly divergent then I get a big output number, Omega. I care more that my predicted Omega has a high R Squared value than absolute error. Now I've been able to get a CNN to work for a simple 2D grid with reasonable results. But I need my model to be able to handle the case where the input shape isn't always a simple 2D plane. The model needs to be able to handle holes in the physical layer the vector field is super imposed on. As in some times there will and won't be holes, and the holes can move. I've briefly looked into Spatial Transformer Networks but its mostly centered on Computer Vision. So much research in this area is for computer vision, and I am playing Research Paper Hop-Scotch trying to find something to help me. Does anyone has any recommendation of current research? Papers or key words to search and find. Thank you in advance. Seriously though I wish there was an ML community that focused on everything ML but Computer Vision. submitted by /u/Slowbutstrong [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Feedforward delay of deep neural networks
    (1 min) I'm wondering if exciting a DNN (ie. feed-forwarding the values of each layer) would ever be a problem on a hardware-limited platform like a small drone. For a time-sensitive application like using a DNN as a controller, how large would the network have to be for the feed-forward process to impose considerable issues in terms of time delay? Have there been any studies exploring this concept? submitted by /u/Cogitarius [link] [comments]
    Looking for a Cyclist Detector or a Database containing images of cyclists.
    (1 min) TLDR; Need help finding a Cyclist Detector or a Database containing images of cyclists. Hello! I have a camera set up looking out a second-story window at the street and would like to count the number of cyclists passing by. I'm using the OpenVINO tech stack to do the detection, but their 'person-vehicle-bike-detection-crossroad-1016' has atrocious accuracy for cyclists. (For reference: https://docs.openvino.ai/2021.2/omz_models_intel_person_vehicle_bike_detection_crossroad_1016_description_person_vehicle_bike_detection_crossroad_1016.html ) I'm asking for a little help in case someone has knowledge on the subject. (My google-fu has been unable to find what I need on this occasion.) If anyone would kindly provide information on a cyclist detection algorithm in Caffe, TensorFlow, Kaldi, or MXNet; I could use the OpenVINO model optimizer to turn them into Intermediate Representations and get it working with my setup. Likewise, the longer route would be to train my own model. However, as many of you would know, getting a proper dataset is the hardest step to get started. If anyone knows an appropriate data set I would be happy to take a look to see if it's suitable. I appreciate your time and suggestions on this matter. submitted by /u/Naviance_Tartico [link] [comments]
  • Reinforcement Learning

    Is optimal action sequence problem in a finite stochastic MDP undecidable?
    (1 min) We know that an optimal policy can be solved for a finite MDP even if it's stochastic. But what if the question is, given a state s_t, what is the optimal action sequence of length k, i.e., {a_t, ..., a_t+k-1}? We know a_t from the optimal policy, but the best we can know about s_t+1 is a probability distribution, which makes it essentially a belief state in a POMDP. Since planning with belief states in POMDPs are generally undecidable, then optimal action sequence problem in a finite stochastic MDP is also undecidable. Or am I wrong? submitted by /u/ritiange [link] [comments]
    What are the best algorithms for team games (3v3, 4v4, 5v5)?
    (1 min) I want to train bots for some team game where each team will have 3 to 5 teammates. Enviroment is partially observable. What are the best algorithms for that kind of an enviroment? submitted by /u/TheGuy839 [link] [comments]
    Applications of RL in instructional sequencing?
    (1 min) Hi all! I'm new here (both to reddit and to the RL world) and very happy to join the community! I am interested in developing algorithms for adaptive learning in education (by adaptive learning I mean algorithms that help students define their own learning path through some sort of educational platform), and I'd like to know if any of you have heard about using RL to that end. I've read some sources where they mention the use of MDPs and POMDPs for instructional sequencing (check this one for instance), but I'm not sure if this subarea has developed any further since. The reason why I think RL might be interesting to me is that eventually, I'd like to work in an algorithm that delivers a collaborative instructional sequence for a group of students. That is, given a bunch of students with a common goal (e.g. doing some teamwork), output an optimal sequence of concepts to study and exercises to solve, so that each one of them passes the subject and such that the group benefits from the individual skills as much as possible. If I base my adaptive learning algorithms on RL, then I could extend it with these collaborative features quite naturally using MARL... I guess that my question here is: does any of this make any sense to you? xd P.S: we are talking about a study and development period of 3-4 year submitted by /u/aloaberasturi [link] [comments]
    Can anyone explain how this happened ? (linear function approximation)
    (0 min) ​ https://preview.redd.it/vlm36ckqbhc81.png?width=940&format=png&auto=webp&s=06ee88e19790f7dc96438e63d034f8b18fe2f045 submitted by /u/Whole_Run_4485 [link] [comments]
    Do softmax values of policy gradient keep the order of values for each action
    (1 min) Assume I train a softmax policy with policy gradient. At test time, I input one state and I get the softmax values for the n actions z_1, z_2, z_3, ... z_n with, for example, z_1 > z_2 > z_3 ... z_n. Can I expect then that the Q values have the same order, i.e. Q(s, a_1) > Q(s, a_2) > ... > Q(s, a_n)? submitted by /u/fedetask [link] [comments]
    Advice needed on RL problem definition. (stochastic, time-varying environment)
    (1 min) Hi everyone! Recently, I have been working on a problem that I want to solve with reinforcement learning. However, I have some doubts about what would be the right way of defining the problem in an RL framework. This is what I have so far: The actions space is quite simple; the agent has a choice of 4 possible (fixed) actions. The state space and environment is where I get a bit stuck. The environment is stochastic and time-varying. A realization of this environment is available as a state. So what is observed of the environment is given as a state. (call this part of the state; state_1) Since the environment is stochastic, the state transitions are stochastic. Also, due to the time-variability, the shape of the distribution changes with time. Another factor is added to the state definition as well (state_2). This factor is actually a (deterministic) function of the state and chosen action. So it is directly related to the chosen action in a state. Furthermore, it is related to the previous value of state_2. As a reward, a factor based on the performance of the action given the state is defined. I think this would be hard for the agent to solve, since it only sees realizations of the environment. Another thought I have is to change the definition of state_1 to the environment distribution parameters of the current time. That way, the agent can make decisions based on the estimated distribution, instead of a realization of it. However, in that case I have to find a method to estimate the distribution. Do you guys have any thoughts on how to solve such a problem? And what algorithm to use? Any advice is welcome! Thanks! submitted by /u/plopthegnome [link] [comments]
    Help with reward function
    (2 min) Hello :D, I have a problem where an agent needs to learn how to reach a certain target location (environment is a grid world). The agent has 5 possible actions, to move (up, down, right, left) or to stay idle. My reward function is as follows: moving has a cost of -1, while staying idle has a cost of 0. If the agent gets closer to the target, the reward is incremented by 1, and decremented by 1 otherwise. Basically, if the agent moves towards the target, the reward for that steps would be 1-1=0, if the agent moves away the reward would be -1-1 = -2, and if the agent stays idle the reward would be 0-1=-1 (-1 for not getting closer to the target). This works perfectly when the agent moves at a speed I started with. However, when I reduce the speed, the agent never learns to finish the task (actually the agent for some reason decides to stay idle all the time). What could be the reason for this? Decreasing the speed generally means that the agent needs more time steps to finish the task. How could this change result in this problem, when the original speed worked just fine. submitted by /u/AhmedNizam_ [link] [comments]
    "The Rise of A.I. Fighter Pilots: Artificial intelligence is being taught to fly warplanes. Can the technology be trusted?"
    (1 min) submitted by /u/gwern [link] [comments]
    Latest CMU Research Improves Reinforcement Learning With Lookahead Policy: Learning Off-Policy with Online Planning
    (1 min) Reinforcement learning (RL) is a technique that allows artificial agents to learn new tasks by interacting with their surroundings. Because of their capacity to use previously acquired data and incorporate input from several sources, off-policy approaches have lately seen a lot of success in RL for effectively learning behaviors in applications like robotics. What is the mechanism of off-policy reinforcement learning? A parameterized actor and a value function are generally used in a model-free off-policy reinforcement learning approach (see Figure 2). The transitions are recorded in the replay buffer as the actor interacts with the environment. The value function is updated by maximizing the action values at the stages visited in the replay buffer. The actor is trained using the transitions from the replay buffer to forecast the cumulative return of the actor. Continue Reading Paper: https://arxiv.org/pdf/2008.10066.pdf Project: https://hari-sikchi.github.io/loop/ Github: https://github.com/hari-sikchi/LOOP CMU Blog: https://blog.ml.cmu.edu/2022/01/07/loop/ submitted by /u/techsucker [link] [comments]

2022-01-17

  • Artificial Intelligence

    I think it might not be a bad idea to think of the corporation itself as a form of AI
    (1 min) While people are certainly involved deeply with the way a corporation functions. It often seems to display a will of its own. There is an inexorable logic to the way businesses are run, and that is ignoring the potential for bad individual actors. Legally they are in many ways first class citizens, because if they murder people they aren't held meaningfully accountable even when they are caught. I think the AI is programmed in places like business schools. That will teach you how to run a successful business, but also will instill in you a certain worldview. A worldview that has very real consequences since it is misaligned with reality. It ignores externalities and tries to make us pay the cost for their harms. I get that this may be a little far out for some, but I do not say these things lightly. All of the tech companies are starting to show their character, and predictably they are harming the world. submitted by /u/Memetic1 [link] [comments]
    Last Week in AI - AI early warning system for new Covid-19 variants, increased federal spending on facial recognition, AI used to detect fake art, and more!
    (1 min) submitted by /u/regalalgorithm [link] [comments]
    Text to AI generated artwork
    (1 min) One page I discovered recently will take songs and have AI generate artwork for the lyrics. I love this idea, but am very green in AI. What ways could I do this? I would like to take books and poems and have AI generate artwork for them based on just the words. I would also like to be able to have these maybe on a canvas to have in my house. What would be the best way to go about it? Edit: I see the Wombo Dream app has a tool that does exactly this, but only allows for 100 characters, and I want to make something that is my own. submitted by /u/ImpeccableSloth33 [link] [comments]
    Artificial Nightmares : Ronald McDonald's Last Supper || Pytti AI Art Video [4K 60 FPS]
    (0 min) submitted by /u/Thenamessd [link] [comments]
    Creating Synthetic Time Series Data for Global Financial Institutions – a POC Deep Dive
    (0 min) https://gretel.ai/blog/creating-synthetic-time-series-data-for-global-financial-institutions-a-poc-deep-dive submitted by /u/Repeat-or [link] [comments]
    For those of you AI lovers, I have created an AI tool that estimates a price for your Udemy course based on some keywords. If you are interested in how I made such a tool, I wrote an article about it. Freel free to check it out. I hope it inspires you! ( :
    (1 min) submitted by /u/amirdol7 [link] [comments]
    What is voyage?
    (0 min) submitted by /u/roblox22y [link] [comments]
    [R] Pushing the Limits of Self-Supervised ResNets: DeepMind’s ReLICv2 Beats Strong Supervised Baselines on ImageNet
    (1 min) A DeepMind research team proposes ReLICv2, which demonstrates for the first time that representations learned without labels can consistently outperform a strong, supervised baseline on ImageNet and even achieve comparable results to state-of-the-art self-supervised vision transformers (ViTs). Here is a quick read: Pushing the Limits of Self-Supervised ResNets: DeepMind’s ReLICv2 Beats Strong Supervised Baselines on ImageNet. The paper Pushing the Limits of Self-supervised ResNets: Can We Outperform Supervised Learning Without Labels on ImageNet? is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Bill Gates & Yuval Noah Harari Discuss: What Happens When AI Knows Humans Better Than They Know Themselves? (short audio clip)
    (0 min) submitted by /u/frog9913 [link] [comments]
    Machine learning is only useful if it is based on reliable data.
    (2 min) For engineers interested in the power of artificial intelligence, the application of machine learning to individual health should be the primary focus. AI has the potential to provide higher-quality treatment to a larger proportion of the world's population at a cheaper cost. Drugs that are linked to the genetic code, as well as million-dollar imaging and diagnostic devices, are out of reach for many of the world's poorest people. In order to deliver on the promise of technology, sciencemen must turn to raw materials that are affordable to the vast majority of the world's population. Machine learning is sensitive to the degree of similarity across people; the physician can learn what works for one patient and then tailor suggestions to the unique qualities of another. The revolutionary promise of AI is that health-care systems no longer have to choose between personalization and scalability. What could possibly go wrong? A lot of it. Bias in medical machine learning may be lethal. The most readily available data for training healthcare models does not reflect the worldwide illness load. Similarly, volunteers in clinical studies do not fully reflect the range of patients. Bias in machine learning is frequently caused by training an algorithm using data that does not accurately reflect reality. The machine's world is defined by what you display it. When the training data is realigned, the algorithm learns to rectify previous tendencies. AI shows health systems that their patients are not a block but a group of individuals with different needs and risks. It is counter-intuitive but if done right, it could reinforce the humanity of medicine by emphasising what a provider notices about a patient, making the interaction between patient and provider ever more central. We have to be ready for the changes that AI can bring to medicine. It is necessary for now because of the huge overload caused by the crisis.. submitted by /u/FrothyDigger [link] [comments]
    TinyML is bringing neural networks to small microcontrollers
    (0 min) submitted by /u/bendee983 [link] [comments]
    Approaching the Edge of the Universe: ‘Handle Nebula’ [ARTBREEDER]
    (0 min) submitted by /u/justafriendlymouse [link] [comments]
  • Machine Learning

    [Discussion]Graph network to embedded vector
    (1 min) Hello, I am looking for methods to turn a graph network of variable size into a fixed length vector for input into a NN. I have already found methods suck as graph2vec but am looking for alternative methods as well due to implementation difficulty. This is assuming the graph is already a list of embedded nodes(from deep walk node2vec or any other node embedding method). Inductive embedding methods are preferred due to the shortcomings of transductive techniques. If anyone can provide tips or sources it would be greatly appreciated! submitted by /u/Upstairs-Show-5962 [link] [comments]
    [D] A Science Journalist’s Journey to Understand AI (The Gradient)
    (1 min) Hi there, just want to share the latest Gradient article A Science Journalist’s Journey to Understand AI. Might be an interesting read to get an outsider's perspective on AI, and be reminded of the misconceptions/views non-technical folks have about it. Though some of these things are also quite specific to the author's point of view. In any case, I think it's worth a read. submitted by /u/regalalgorithm [link] [comments]
    [P] UltimateAdam for Tensorflow - transition from SGD to Adam
    (2 min) Hi all. I've been playing around with a new optimizer that I've had some good success with. We know now that one of the main problems with Adam is that variance of adaptive learn rates in the early steps of optimization cause a sub-optimal trajectory, which ultimately leads to converging to a sub-optimal final minima. As it turns out this is the main reason why many people say that SGD is the golden standard for final performance. There have been studies recently that show when Adam is tuned correctly and the variance issues are dealt with, Adam can actually achieve even better final performance than SGD in many cases. Rectified Adam is one way that we try to deal with this variance issue by warming up the learn rate in the early steps, so that the trajectory is more stable and better qu…
    [D] Federated Learning Libraries in 2022?
    (1 min) I'm starting a new experiment on federated learning in Python and was looking to see what libraries were available. It seems like PySyft was the most popular, but there seems to be multiple new options now (IBM, Flower, TensorFlow, FedML). Does anyone have experience to know which of these would be the best to use? submitted by /u/JustARoom [link] [comments]
    [P] How we generate high-quality synthetic time-series data for one of the largest financial institutions in the world.
    (1 min) In this project, we discuss the creation of high-quality synthetic time-series datasets, and the methods we designed to assess the accuracy and privacy of our models and data. Code and examples available with the blog. https://gretel.ai/blog/creating-synthetic-time-series-data-for-global-financial-institutions-a-poc-deep-dive submitted by /u/alig80 [link] [comments]
    [D] NLP: Knowledge based vs neural based
    (0 min) I have been working in nlp with neural language models (W2V, transformers, etc...) for a while and wanted to expand my skill set to the knowledge based approaches. Are knowledge based approaches used in industry and how can they complement neural based approaches? Any book/learning material reference is highly appreciated, thanks! submitted by /u/eatpasta_runfastah [link] [comments]
    [D] AISTATS 2022 Paper Acceptance Result
    (0 min) AISTATS 2022 paper acceptance results are supposed to be released tomorrow. Creating a discussion thread for this year's results. submitted by /u/zy415 [link] [comments]
    [Discussion] How do you organize a data science team process?
    (2 min) Working in R&D can sometimes be a pain in the a*, especially when designing a process that would fit the whole team. I've heard about and have worked on projects where there were no daily meetings, no discussion about technology across the team, and it was kind of free-form till the due date. Enforcing rules on data scientists often backfires with their productivity too. How do you tackle such a problem in strictly research & development areas? submitted by /u/InGraphsWeTrust [link] [comments]
    [R] CoAtNet: Marrying Convolution and Attention for All Data Sizes (video explanation)
    (0 min) Here is an youtube video explaining the best Image Classification model available today namely CoAtNet. Hope its useful: https://youtu.be/VoRQiKQcdcI submitted by /u/Combination-Fun [link] [comments]
    How useful is knowledge of parallel programming in ML? [D]
    (3 min) I'm an aspiring ML engineer. Wondering how useful or applicable is the knowledge of parallel computation in the world of AI/ML? submitted by /u/mha_1992 [link] [comments]
    [D] Joshua Tenenbaum's opinions on Neural Generative Models
    (2 min) TL;DR Hail Mary Request: Does anyone know what Josh thinks of VAE's, GANs and other Neural Generative Models, or maybe have a link to a source in which he describes his opinions on them? Hey all, I've recently been doing some literature review on Probabilistic Programming and have been tearing through Josh Tenenbaum's bibliography and watching several presentations of his. During this, I (maybe) came across a presentation on YouTube in which he mentioned his opinion of the usefulness of VAE's and GANs as generative models. Something along the lines of "They're a step in the right direction (Generative Modelling) but don't work as GM's nearly as well as probabilistic programs". Unfortunately I've totally lost track of the source of this throwaway insight and can't even be sure if it wasn't from a paper, from someone other Probabilistic Programming researcher, or it was an opinion of his I imagined that just happens to fit my model of him (Pun Intended). I've already spent the morning trying to track down the source of this supposed opinion of his so I decided to just try asking this subreddit on the off chance anyone knows what he thinks and especially where he shared it. Thanks for your time, A Certified Money License Issuer submitted by /u/MoneyLicense [link] [comments]
    [P] Cherche - allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.
    (0 min) submitted by /u/binaryfor [link] [comments]
    [P] Open-source library to process unstructured data for ML tasks
    (1 min) Excited to share Docarray, an open-source python library to store and process unstructured data such as text, image, audio, video, or 3D mesh. Useful in processing data for ML tasks such as embed, search, recommend, etc. DocArray aims to be the data structure for unstructured data DocArray consists of two simple concepts: Document: a data structure for easily representing nested, unstructured data 2 DocumentArray: a container for efficiently accessing, manipulating, and understanding multiple Documents Why did I build it? While working on Jina(an AI powered Search framework), I needed a way to store and process the large amounts of unstructured data for the purpose of creating embedding and build search on top of that. I tried solutions such as json, numpy.ndarray, pandas.DataFrame, Protobuf, etc. But they were not suitable for our computation intensive tasks for unstructured and nested data. Ask me question if you need more info on this. Checkout GitHub repository for examples. This project has been used and tested well in my other project(Jina), but there's a lot of scope of making it more useful for the community. Ask me your questions and share your suggestions submitted by /u/opensourcecolumbus [link] [comments]
  • Reinforcement Learning

    How does OpenAI PPO Parallelism work?
    (1 min) I've already seen this post and this one. But it is still unclear to me some of the details. In particular I'm wondering how this parallelism differs from 'batched environment' parallelism? Is at all unique to PPO? Also how scalable is this? It seems like they are just increasing the batch size enormously (to like 1m examples), does this ever face diminishing returns? submitted by /u/Udon_noodles [link] [comments]
    Selecting agent actions (Beginner Question)
    (1 min) I'd like to think I have a basic amount of high level understanding of RL when asking this question. What would be normal when writing an agent selecting an action, ensuring the agent selects from any valid actions, or letting the action select any action and just penalizing it if the action is illegal and retaining the state. submitted by /u/DisastrousTrainer322 [link] [comments]
    "Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics", Swayamdipta et al 2020
    (1 min) submitted by /u/gwern [link] [comments]

2022-01-16

  • Artificial Intelligence

    What are the best AI face editing apps/software you know?
    (0 min) 2 Question: What is the best AI which can turn you into an anime character? submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Most exciting Recent or Upcoming AI Tech?
    (1 min) Im looking for a topic to do a research AI WhitePaper on and I was curious if anyone had any interesting ideas. Does not have to be the most complex thing in the world. Its basically just: the use of an AI / machine learning solution to solve a specific problem in a business. A good example: something like an AI ChatBot replacing HR employee engagement surveys. submitted by /u/kmalternative [link] [comments]
    The Messy History of Facial Recognition Company Clearview AI
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    Their is something about Ai generated architecture.. it always comes out mind blowing
    (1 min) submitted by /u/snowpixelapp [link] [comments]
    I'm looking to join a Project/Research of 3D Machine Learning from abstract 2D Art that makes several outputs of 3Dmodeled Art from a single non-figurative 2D Image.
    (1 min) The applications for this could disrupt Architecture & Sculpture worlds. Which companies/startups/researching teams could be interested in this research PC app project? I'm really interested in being apart of a project like this. Please message me if you want to start. I have more Specs. submitted by /u/Niu_Davinci [link] [comments]
    Isn't this table for perceptron of AND gate wrong for B?Because we use updated bias and not the original bias? Please clear my confusion
    (1 min) submitted by /u/barnyard9 [link] [comments]
    Is Machine Learning Research Oversaturated?
    (0 min) submitted by /u/SlickBlueML [link] [comments]
  • Machine Learning

    [P] Discord Community for game developers using ML models/GPT3 in their games
    (1 min) submitted by /u/22vortex22 [link] [comments]
    [R] AIoT and Federated Learning Platform
    (1 min) Are there any AIoT or Federated learning platforms that are recommended to help write and test/train models for edge devices? Looking for a platform that allows me to write a jupyter notebook, and simulate an environment of N edge devices - each device with unique data. Any other pointers, I'm all ears! submitted by /u/morseky1 [link] [comments]
    [Discussion] A ML-based intelligence model
    (1 min) To my current knowledge, the basis for IQ tests is the g-factor model developped over a century ago by Spearman, using factor analysis. A general intelligence factor has been generally admitted, and a lot of factor analysis studies have been done to estimate such a factor and correlations,etc. More recently, the consensual idea has been of a hierarchical model : the g-factor (with no consensus on its interpretation) at the top, broad abilities below, and specific abilities below. Now, I didn't find any attempt to create a general intelligence model from more modern tools. What about all the non-linear dimensionalily reductions techniques (kernel based, manifold data...) or deep learning models,etc ? submitted by /u/hugoboum [link] [comments]
    [D] Machine Learning - WAYR (What Are You Reading) - Week 129
    (1 min) This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 2 Week 12 Week 22 Week 32…
    [P] Feature extraction for acoustic signals
    (2 min) Wondering if anyone here has any experience with feature extraction for acoustic signals, specifically broadband noises like gunshots? I’m an ecologist by training and have very little experience with ML but have gone through the literature some and found some methods others have used on acoustic data like extracting mel-frequency cepstrum coefficients. But I am a bit lost on what exactly is being done here if someone is willing to explain this and/or suggest other methods that might be better suited for extracting relevant features from broadband signals. Thank you!!! submitted by /u/Life-Volume135 [link] [comments]
    [R] The Annotated S4: Efficiently Modeling Long Sequences with Structured State Spaces
    (1 min) Blog post: https://srush.github.io/annotated-s4/ Original paper: https://arxiv.org/abs/2111.00396 This is a blog post that introduces a recent model called S4: (Structured State Space for Sequence Modeling). It is somewhat different from most previous architectures such as CNNs, Transformers, etc, and this work seeks to explain the model using code examples and figures. Here's the first paragraph of the post: The Structured State Space for Sequence Modeling (S4) architecture is a new approach to very long-range sequence modeling tasks for vision, language, and audio, showing a capacity to capture dependencies over tens of thousands of steps. Especially impressive are the model’s results on the challenging Long Range Arena benchmark, showing an ability to reason over sequences of up to 16,000+ elements with high accuracy. Rosanne Liu (one of the co-founders of ML Collective) considers the paper as one of the most underrated papers of 2021, and labeled the Annotated S4 blog post as a "must-read" (tweet). submitted by /u/jwuphysics [link] [comments]
    [R] Instant Neural Graphics Primitives with a Multiresolution Hash Encoding (Training a NeRF takes 5 seconds!)
    (3 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Simple Questions Thread
    (2 min) Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]
    [P] I'm building a Neural Search Plugin for Elastic/Opensearch
    (2 min) Hey Everyone, I'm building a plugin for end-to-end neural search (eg SBERT, Hugging Face, Doc2Vec etc) in Opensearch and I'd love to hear any suggestions from the community of what you might find useful or about issues you've had with Opensearch/neural search in the past. Despite the Elasticsearch website claiming in many places that you can do "machine learning" with Elasticsearch, I've found that it's not straight forward at all to use neural search algos with ES/Opensearch. In most cases (putting to one side some specific cases like anomaly detection), you have to implement the ML algorithm yourself and you only get to use ES as a storage layer. Some 3rd party plug and play frameworks that support ES seem quite promising but also lack functionality in terms of data retrieval - for exa…
    [D] Best Object Segmentation Repo with Temporal Dependency?
    (1 min) submitted by /u/ZeApelido [link] [comments]
    [P] Ornette: ML-based Music Generation System
    (1 min) Hi all. I'm developing a TUI tool for continuous music generation with Machine Learning. You can pick a machine learning model and generate sound with it via SuperCollider, or save its output as MIDI and render the audio yourself later. You can also load some MIDI clip to use it as a primer. I'll be happy to hear opinions and answer questions. Cheers video demo github submitted by /u/ghalestrilo [link] [comments]
    [D] Generic version of Numerai (with non-trading challenges)?
    (0 min) Does anyone know if there exists a generic version of Numerai? What I mean is a platform where, just like on Numerai, you can get paid as a data scientist for solving challenges, but where challenges are more broad and not just trading? submitted by /u/xepo3abp [link] [comments]
    [P] Open-source tool for building NLP training sets with weak supervision and search queries
    (1 min) submitted by /u/dvilasuero [link] [comments]
    [D] What are the popular AI/ML Conferences?
    (2 min) We created a page that organizes papers from popular conferences. Which conferences have we missed? Link: https://papers.labml.ai/conferences We collected papers from ICLR, NEURIPS, ICCV, CVPR, ICML, IJCAI, EMNLP, ACL and AAAI. ICLR 2022 has papers from the review stage. PS: Since most AI paper preprints are readily available in Arxiv now, what effect do you think it has if the paper is published in a popular conference? submitted by /u/hnipun [link] [comments]
    [R] Consistent Style Transfer
    (0 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [P] Getting 3D sneaker models from a single 2D sideview
    (1 min) submitted by /u/InfamousPancakes [link] [comments]
    [D] Why isn't crowdsourced ML more popular?
    (3 min) I'm wondering why projects like OpenML haven't really taken off. The idea behind it is that people in science who have datasets they'd like modeled can upload their data, set tasks or questions they want to answer, and then let people who want to hone their ML skills attempt to create predictive models from the data. It was launched in 2014, but hasn't really gained any traction in the ML and data science world. Do people here have any thoughts on why crowdsourced ML hasn't caught on? submitted by /u/EducationalCicada [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Wine Quality Prediction with Sequential and Decision Tree models | Kolbenkraft
    (0 min) submitted by /u/cjmodi306 [link] [comments]
    I'm building a plugin for neural search in ElasticSearch/Opensearch
    (1 min) Hey Everyone, I'm building a plugin for end-to-end neural search (eg SBERT, Hugging Face, Doc2Vec etc) in Opensearch and it would be great to hear any suggestions from the NLP community of what you might find useful or about issues you've had with Opensearch/neural search in the past. Ideally, the plugin would allow the user to configure/train an algorithm and then it would allow them to work entirely with text, using the vector representations in the background. I'm also exploring extending this and allowing users to refresh/update these embeddings to continually improve them. Let me know what you think, open to any suggestions! If you want to keep up to date with this, here is a google form https://forms.gle/acmGTK1gPkPZbVJm8 submitted by /u/everythingserverless [link] [comments]
    Isn't this table for perceptron of AND gate wrong for B?Because we use updated bias and not the original bias? Please clear my confusion
    (1 min) submitted by /u/barnyard9 [link] [comments]
  • Reinforcement Learning

    Are multi agent or self-play environments always automatically POMDPs?
    (1 min) Let’s say we look at the game of Atari Pong. In Deepminds Atari paper from 2015, the state was represented by a stack of 4 frames, so it would have the Markov property regarding the ball speed and direction. I’m having a similar environment to Pong and wanted to perform Self-Play to enhance the wall clock training time pf my agent. I considered Pong with a state representation as described above as an MDP. But with another agent, I’m not so sure. I mean the strategies and intentions of the enemy are not encoded in the observation. Does this make the example a POMDP? And wouldn’t this mean that most of multi agent environments would be POMDP? Even though the game is a perfect information game? submitted by /u/Educational_Dot_3024 [link] [comments]
    Clarification regarding advantage calculation in PPO
    (1 min) Hello :D, Regarding the advantage calculation (in GAE for example), what happens if multiple episodes finish within the same batch. I've seen works that mask out everything after an episode terminates, but im not sure if they keep all finished episodes or keep only one finished episode, in case the batch has multiple finished episodes. submitted by /u/AhmedNizam_ [link] [comments]
    Please give me a tip on training InverseRL (like GAIL)
    (1 min) Hi, I'm trying to train some agents in POMDP tabular environment using expert trajectories. (I know it's not the best way to solve this environment.) But, In my case, not only agent but also discriminator gets limited observation. First, behavior cloning showed acceptable performances. But, GAIL with PPO(+cross entropy term) showed very unstable learning. The agent got high value, but cumulated real rewards were so small. I tried to change network capacity I tried to change activation function to tanh I tried to change agent's learning rate(smaller than before) Please give me a tip for training GAIL discriminator! Very Thanks! (train result image is not uploaded. so plz click this link) ​ submitted by /u/Spiritual_Fig3632 [link] [comments]

2022-01-15

  • cs.LG updates on arXiv.org

    An adaptable cognitive microcontroller node for fitness activity recognition. (arXiv:2201.05110v1 [cs.LG])
    (2 min) The new generation of wireless technologies, fitness trackers, and devices with embedded sensors can have a big impact on healthcare systems and quality of life. Among the most crucial aspects to consider in these devices are the accuracy of the data produced and power consumption. Many of the events that can be monitored, while apparently simple, may not be easily detectable and recognizable by devices equipped with embedded sensors, especially on devices with low computing capabilities. It is well known that deep learning reduces the study of features that contribute to the recognition of the different target classes. In this work, we present a portable and battery-powered microcontroller-based device applicable to a wobble board. Wobble boards are low-cost equipment that can be used for sensorimotor training to avoid ankle injuries or as part of the rehabilitation process after an injury. The exercise recognition process was implemented through the use of cognitive techniques based on deep learning. To reduce power consumption, we add an adaptivity layer that dynamically manages the device's hardware and software configuration to adapt it to the required operating mode at runtime. Our experimental results show that adjusting the node configuration to the workload at runtime can save up to 60% of the power consumed. On a custom dataset, our optimized and quantized neural network achieves an accuracy value greater than 97% for detecting some specific physical exercises on a wobble board.
    Outcome-Driven Dynamic Refugee Assignment with Allocation Balancing. (arXiv:2007.03069v4 [math.OC] UPDATED)
    (2 min) This study proposes two new dynamic assignment algorithms to match refugees and asylum seekers to geographic localities within a host country. The first, currently implemented in a multi-year pilot in Switzerland, seeks to maximize the average predicted employment level (or any measured outcome of interest) of refugees through a minimum-discord online assignment algorithm. Although the proposed algorithm achieves near-optimal expected employment compared to the hindsight-optimal solution (and improves upon the status quo procedure by about 40%), it results in a periodically imbalanced allocation to the localities over time. This leads to undesirable workload inefficiencies for resettlement resources and agents. To address this problem, the second algorithm balances the goal of improving refugee outcomes with the desire for an even allocation over time. The performance of the proposed methods is illustrated using real refugee resettlement data from a large resettlement agency in the United States. On this dataset, we find that the allocation balancing algorithm can achieve near-perfect balance over time with only a small loss in expected employment compared to the pure employment-maximizing algorithm. In addition, the allocation balancing algorithm offers a number of ancillary benefits compared to pure outcome-maximization, including robustness to unknown arrival flows and greater exploration.
    DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. (arXiv:2105.02446v5 [eess.AS] UPDATED)
    (2 min) Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e.g., mel-spectrogram) given a music score. Previous singing acoustic models adopt a simple loss (e.g., L1 and L2) or generative adversarial network (GAN) to reconstruct the acoustic features, while they suffer from over-smoothing and unstable training issues respectively, which hinder the naturalness of synthesized singing. In this work, we propose DiffSinger, an acoustic model for SVS based on the diffusion probabilistic model. DiffSinger is a parameterized Markov chain that iteratively converts the noise into mel-spectrogram conditioned on the music score. By implicitly optimizing variational bound, DiffSinger can be stably trained and generate realistic outputs. To further improve the voice quality and speed up inference, we introduce a shallow diffusion mechanism to make better use of the prior knowledge learned by the simple loss. Specifically, DiffSinger starts generation at a shallow step smaller than the total number of diffusion steps, according to the intersection of the diffusion trajectories of the ground-truth mel-spectrogram and the one predicted by a simple mel-spectrogram decoder. Besides, we propose boundary prediction methods to locate the intersection and determine the shallow step adaptively. The evaluations conducted on a Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work. Extensional experiments also prove the generalization of our methods on text-to-speech task (DiffSpeech). Audio samples: https://diffsinger.github.io. Codes: https://github.com/MoonInTheRiver/DiffSinger.
    Learning to Break Deep Perceptual Hashing: The Use Case NeuralHash. (arXiv:2111.06628v3 [cs.LG] UPDATED)
    (2 min) Apple recently revealed its deep perceptual hashing system NeuralHash to detect child sexual abuse material (CSAM) on user devices before files are uploaded to its iCloud service. Public criticism quickly arose regarding the protection of user privacy and the system's reliability. In this paper, we present the first comprehensive empirical analysis of deep perceptual hashing based on NeuralHash. Specifically, we show that current deep perceptual hashing may not be robust. An adversary can manipulate the hash values by applying slight changes in images, either induced by gradient-based approaches or simply by performing standard image transformations, forcing or preventing hash collisions. Such attacks permit malicious actors easily to exploit the detection system: from hiding abusive material to framing innocent users, everything is possible. Moreover, using the hash values, inferences can still be made about the data stored on user devices. In our view, based on our results, deep perceptual hashing in its current form is generally not ready for robust client-side scanning and should not be used from a privacy perspective.
    Deep clustering with fusion autoencoder. (arXiv:2201.04727v1 [cs.LG])
    (2 min) Embracing the deep learning techniques for representation learning in clustering research has attracted broad attention in recent years, yielding a newly developed clustering paradigm, viz. the deep clustering (DC). Typically, the DC models capitalize on autoencoders to learn the intrinsic features which facilitate the clustering process in consequence. Nowadays, a generative model named variational autoencoder (VAE) has got wide acceptance in DC studies. Nevertheless, the plain VAE is insufficient to perceive the comprehensive latent features, leading to the deteriorative clustering performance. In this paper, a novel DC method is proposed to address this issue. Specifically, the generative adversarial network and VAE are coalesced into a new autoencoder called fusion autoencoder (FAE) for discerning more discriminative representation that benefits the downstream clustering task. Besides, the FAE is implemented with the deep residual network architecture which further enhances the representation learning ability. Finally, the latent space of the FAE is transformed to an embedding space shaped by a deep dense neural network for pulling away different clusters from each other and collapsing data points within individual clusters. Experiment conducted on several image datasets demonstrate the effectiveness of the proposed DC model against the baseline methods.
    Meta-Learning Guarantees for Online Receding Horizon Learning Control. (arXiv:2010.11327v14 [eess.SY] UPDATED)
    (2 min) In this paper we provide provable regret guarantees for an online meta-learning receding horizon control algorithm in an iterative control setting. We consider the setting where, in each iteration the system to be controlled is a linear deterministic system that is different and unknown, the cost for the controller in an iteration is a general additive cost function and there are affine control input constraints. By analysing conditions under which sub-linear regret is achievable, we prove that the meta-learning online receding horizon controller achieves an average of the dynamic regret for the controller cost that is $\tilde{O}((1+1/\sqrt{N})T^{3/4})$ with the number of iterations $N$. Thus, we show that the worst regret for learning within an iteration improves with experience of more iterations, with guarantee on rate of improvement.
    Modeling Implicit Bias with Fuzzy Cognitive Maps. (arXiv:2112.12713v2 [cs.LG] UPDATED)
    (2 min) This paper presents a Fuzzy Cognitive Map model to quantify implicit bias in structured datasets where features can be numeric or discrete. In our proposal, problem features are mapped to neural concepts that are initially activated by experts when running what-if simulations, whereas weights connecting the neural concepts represent absolute correlation/association patterns between features. In addition, we introduce a new reasoning mechanism equipped with a normalization-like transfer function that prevents neurons from saturating. Another advantage of this new reasoning mechanism is that it can easily be controlled by regulating nonlinearity when updating neurons' activation values in each iteration. Finally, we study the convergence of our model and derive analytical conditions concerning the existence and unicity of fixed-point attractors.
    Auto IV: Counterfactual Prediction via Automatic Instrumental Variable Decomposition. (arXiv:2107.05884v2 [cs.LG] UPDATED)
    (2 min) Instrumental variables (IVs), sources of treatment randomization that are conditionally independent of the outcome, play an important role in causal inference with unobserved confounders. However, the existing IV-based counterfactual prediction methods need well-predefined IVs, while it is an art rather than science to find valid IVs in many real-world scenes. Moreover, the predefined hand-made IVs could be weak or erroneous by violating the conditions of valid IVs. These thorny facts hinder the application of the IV-based counterfactual prediction methods. In this paper, we propose a novel Automatic Instrumental Variable decomposition (AutoIV) algorithm to automatically generate representations serving the role of IVs from observed variables (IV candidates). Specifically, we let the learned IV representations satisfy the relevance condition with the treatment and exclusion condition with the outcome via mutual information maximization and minimization constraints, respectively. We also learn confounder representations by encouraging them to be relevant to both the treatment and the outcome. The IV and confounder representations compete for the information with their constraints in an adversarial game, which allows us to get valid IV representations for IV-based counterfactual prediction. Extensive experiments demonstrate that our method generates valid IV representations for accurate IV-based counterfactual prediction.
    Flexible Networks for Learning Physical Dynamics of Deformable Objects. (arXiv:2112.03728v2 [cs.CV] UPDATED)
    (2 min) Learning the physical dynamics of deformable objects with particle-based representation has been the objective of many computational models in machine learning. While several state-of-the-art models have achieved this objective in simulated environments, most existing models impose a precondition, such that the input is a sequence of ordered point sets. That is, the order of the points in each point set must be the same across the entire input sequence. This precondition restrains the model from generalizing to real-world data, which is considered to be a sequence of unordered point sets. In this paper, we propose a model named time-wise PointNet (TP-Net) that solves this problem by directly consuming a sequence of unordered point sets to infer the future state of a deformable object with particle-based representation. Our model consists of a shared feature extractor that extracts global features from each input point set in parallel and a prediction network that aggregates and reasons on these features for future prediction. The key concept of our approach is that we use global features rather than local features to achieve invariance to input permutations and ensure the stability and scalability of our model. Experiments demonstrate that our model achieves state-of-the-art performance with real-time prediction speed in both synthetic dataset and real-world dataset. In addition, we provide quantitative and qualitative analysis on why our approach is more effective and efficient than existing approaches.
    DebiasedDTA: Model Debiasing to Boost Drug-Target Affinity Prediction. (arXiv:2107.05556v3 [q-bio.QM] UPDATED)
    (2 min) Motivation: Computational models that accurately identify high-affinity protein-chemical pairs can accelerate drug discovery pipelines. These models, trained on available protein-chemical interaction datasets, can be used to predict the binding affinity of an input protein-chemical pair. However, the training datasets may contain surface patterns, called dataset biases, which cause models to memorize dataset-specific biomolecule properties, instead of learning binding mechanisms. As a result, the prediction performance of models drops for unseen biomolecules. Here, we present DebiasedDTA, a novel drug-target affinity (DTA) prediction model training framework that addresses dataset biases to improve affinity prediction for novel biomolecules. DebiasedDTA uses ensemble learning and sample weight adaptation to identify and avoid biases and is applicable to most DTA prediction models. Results: The results show that DebiasedDTA can boost models while predicting the interactions between unseen biomolecules. In addition, prediction performance for seen biomolecules also improves. The experiments also show that DebiasedDTA can augment DTA prediction models of different input and model structures and is able to avoid biases of different sources. The investigations of predictions reveal that model debiasing can diminish the importance of misleading features and can enable models to learn more from the proteins. DebiasedDTA is published as an open-source python package to enable debiasing custom DTA prediction models with only two lines of code.
    DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training. (arXiv:2007.03298v2 [cs.LG] UPDATED)
    (2 min) Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy. In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups to synchronize independently in a bottleneck-free manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among different groups to ensure a global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to $94\%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.
    Multimodal Representation for Neural Code Search. (arXiv:2107.00992v3 [cs.SE] UPDATED)
    (2 min) Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single corpus that is large-scale and multi-language: CodeSearchNet. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of code search. Last, we define intuitive quantification metrics oriented to the completeness of semantic and syntactic information of the code data, to help understand the experimental findings.
    NURBS-Diff: A Differentiable Programming Module for NURBS. (arXiv:2104.14547v4 [cs.LG] UPDATED)
    (2 min) Boundary representations (B-reps) using Non-Uniform Rational B-splines (NURBS) are the de facto standard used in CAD, but their utility in deep learning-based approaches is not well researched. We propose a differentiable NURBS module to integrate NURBS representations of CAD models with deep learning methods. We mathematically define the derivatives of the NURBS curves or surfaces with respect to the input parameters (control points, weights, and the knot vector). These derivatives are used to define an approximate Jacobian used for performing the "backward" evaluation to train the deep learning models. We have implemented our NURBS module using GPU-accelerated algorithms and integrated it with PyTorch, a popular deep learning framework. We demonstrate the efficacy of our NURBS module in performing CAD operations such as curve or surface fitting and surface offsetting. Further, we show its utility in deep learning for unsupervised point cloud reconstruction and enforce analysis constraints. These examples show that our module performs better for certain deep learning frameworks and can be directly integrated with any deep-learning framework requiring NURBS.
    SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets. (arXiv:2111.06467v2 [cs.CL] UPDATED)
    (2 min) NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.
    A Comparative Study on Basic Elements of Deep Learning Models for Spatial-Temporal Traffic Forecasting. (arXiv:2111.07513v2 [cs.LG] UPDATED)
    (2 min) Traffic forecasting plays a crucial role in intelligent transportation systems. The spatial-temporal complexities in transportation networks make the problem especially challenging. The recently suggested deep learning models share basic elements such as graph convolution, graph attention, recurrent units, and/or attention mechanism. In this study, we designed an in-depth comparative study for four deep neural network models utilizing different basic elements. For base models, one RNN-based model and one attention-based model were chosen from previous literature. Then, the spatial feature extraction layers in the models were substituted with graph convolution and graph attention. To analyze the performance of each element in various environments, we conducted experiments on four real-world datasets - highway speed, highway flow, urban speed from a homogeneous road link network, and urban speed from a heterogeneous road link network. The results demonstrate that the RNN-based model and the attention-based model show a similar level of performance for short-term prediction, and the attention-based model outperforms the RNN in longer-term predictions. The choice of graph convolution and graph attention makes a larger difference in the RNN-based models. Also, our modified version of GMAN shows comparable performance with the original with less memory consumption.
    On component interactions in two-stage recommender systems. (arXiv:2106.14979v3 [cs.IR] UPDATED)
    (2 min) Thanks to their scalability, two-stage recommenders are used by many of today's largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators, tuned for low prediction latency, preselect a small subset of candidates from the whole item pool; (ii) a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as mere sums of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of the individual components in isolation. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that independent nominator training can lead to performance on par with uniformly random recommendations. We find that careful design of item pools, each assigned to a different nominator, alleviates these issues. As manual search for a good pool allocation is difficult, we propose to learn one instead using a Mixture-of-Experts based approach. This significantly improves both precision and recall at K.
    A Concentration Bound for LSPE($\lambda$). (arXiv:2111.02644v2 [cs.LG] UPDATED)
    (2 min) The popular LSPE($\lambda$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.
    Discovering Governing Equations from Partial Measurements with Deep Delay Autoencoders. (arXiv:2201.05136v1 [cs.LG])
    (2 min) A central challenge in data-driven model discovery is the presence of hidden, or latent, variables that are not directly measured but are dynamically important. Takens' theorem provides conditions for when it is possible to augment these partial measurements with time delayed information, resulting in an attractor that is diffeomorphic to that of the original full-state system. However, the coordinate transformation back to the original attractor is typically unknown, and learning the dynamics in the embedding space has remained an open challenge for decades. Here, we design a custom deep autoencoder network to learn a coordinate transformation from the delay embedded space into a new space where it is possible to represent the dynamics in a sparse, closed form. We demonstrate this approach on the Lorenz, R\"ossler, and Lotka-Volterra systems, learning dynamics from a single measurement variable. As a challenging example, we learn a Lorenz analogue from a single scalar variable extracted from a video of a chaotic waterwheel experiment. The resulting modeling framework combines deep learning to uncover effective coordinates and the sparse identification of nonlinear dynamics (SINDy) for interpretable modeling. Thus, we show that it is possible to simultaneously learn a closed-form model and the associated coordinate system for partially observed dynamics.
    Periodic Freight Demand Estimation for Large-scale Tactical Planning. (arXiv:2105.09136v3 [cs.LG] UPDATED)
    (2 min) Freight carriers rely on tactical planning to design their service network to satisfy demand in a cost-effective way. For computational tractability, deterministic and cyclic Service Network Design (SND) formulations are used to solve large-scale problems. A central input is the periodic demand, that is, the demand expected to repeat in every period in the planning horizon. In practice, demand is predicted by a time series forecasting model and the periodic demand is the average of those forecasts. This is, however, only one of many possible mappings. The problem consisting in selecting this mapping has hitherto been overlooked in the literature. We propose to use the structure of the downstream decision-making problem to select a good mapping. For this purpose, we introduce a multilevel mathematical programming formulation that explicitly links the time series forecasts to the SND problem of interest. The solution is a periodic demand estimate that minimizes costs over the tactical planning horizon. We report results in an extensive empirical study of a large-scale application from the Canadian National Railway Company. They clearly show the importance of the periodic demand estimation problem. Indeed, the planning costs exhibit an important variation over different periodic demand estimates and using an estimate different from the mean forecast can lead to substantial cost reductions. Moreover, the costs associated with the periodic demand estimates based on forecasts were comparable to, or even better than those obtained using the mean of actual demand.
    Generalized Kernel Ridge Regression for Long Term Causal Inference: Treatment Effects, Dose Responses, and Counterfactual Distributions. (arXiv:2201.05139v1 [econ.EM])
    (2 min) I propose kernel ridge regression estimators for long term causal inference, where a short term experimental data set containing randomized treatment and short term surrogates is fused with a long term observational data set containing short term surrogates and long term outcomes. I propose estimators of treatment effects, dose responses, and counterfactual distributions with closed form solutions in terms of kernel matrix operations. I allow covariates, treatment, and surrogates to be discrete or continuous, and low, high, or infinite dimensional. For long term treatment effects, I prove $\sqrt{n}$ consistency, Gaussian approximation, and semiparametric efficiency. For long term dose responses, I prove uniform consistency with finite sample rates. For long term counterfactual distributions, I prove convergence in distribution.
    Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks. (arXiv:2201.04738v1 [stat.ML])
    (2 min) We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that in the underparameterized regime the network learns eigenfunctions of an integral operator $T_{K^\infty}$ determined by the Neural Tangent Kernel (NTK) at rates corresponding to their eigenvalues. For example, for uniformly distributed data on the sphere $S^{d - 1}$ and rotation invariant weight distributions, the eigenfunctions of $T_{K^\infty}$ are the spherical harmonics. Our results can be understood as describing a spectral bias in the underparameterized regime. The proofs use the concept of "Damped Deviations", where deviations of the NTK matter less for eigendirections with large eigenvalues due to the occurence of a damping factor. Aside from the underparameterized regime, the damped deviations point-of-view can be used to track the dynamics of the empirical risk in the overparameterized setting, allowing us to extend certain results in the literature. We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
    Deep Learning-based Extreme Heatwave Forecast. (arXiv:2103.09743v3 [cs.LG] UPDATED)
    (2 min) Because of the impact of extreme heat waves and heat domes on society and biodiversity, their study is a key challenge. We specifically study long-lasting extreme heat waves, which are among the most important for climate impacts. Physics driven weather forecast systems or climate models can be used to forecast their occurrence or predict their probability. The present work explores the use of deep learning architectures, trained using outputs of a climate model, as an alternative strategy to forecast the occurrence of extreme long-lasting heatwaves. This new approach will be useful for several key scientific goals which include the study of climate model statistics, building a quantitative proxy for resampling rare events in climate models, study the impact of climate change, and should eventually be useful for forecasting. Fulfilling these important goals implies addressing issues such as class-size imbalance that is intrinsically associated with rare event prediction, assessing the potential benefits of transfer learning to address the nested nature of extreme events (naturally included in less extreme ones). We train a Convolutional Neural Network, using 1000 years of climate model outputs, with large-class undersampling and transfer learning. From the observed snapshots of the surface temperature and the 500 hPa geopotential height fields, the trained network achieves significant performance in forecasting the occurrence of long-lasting extreme heatwaves. We are able to predict them at three different levels of intensity, and as early as 15 days ahead of the start of the event (30 days ahead of the end of the event).
    Symmetric Sparse Boolean Matrix Factorization and Applications. (arXiv:2102.01570v3 [cs.LG] UPDATED)
    (2 min) In this work, we study a variant of nonnegative matrix factorization where we wish to find a symmetric factorization of a given input matrix into a sparse, Boolean matrix. Formally speaking, given $\mathbf{M}\in\mathbb{Z}^{m\times m}$, we want to find $\mathbf{W}\in\{0,1\}^{m\times r}$ such that $\| \mathbf{M} - \mathbf{W}\mathbf{W}^\top \|_0$ is minimized among all $\mathbf{W}$ for which each row is $k$-sparse. This question turns out to be closely related to a number of questions like recovering a hypergraph from its line graph, as well as reconstruction attacks for private neural network training. As this problem is hard in the worst-case, we study a natural average-case variant that arises in the context of these reconstruction attacks: $\mathbf{M} = \mathbf{W}\mathbf{W}^{\top}$ for $\mathbf{W}$ a random Boolean matrix with $k$-sparse rows, and the goal is to recover $\mathbf{W}$ up to column permutation. Equivalently, this can be thought of as recovering a uniformly random $k$-uniform hypergraph from its line graph. Our main result is a polynomial-time algorithm for this problem based on bootstrapping higher-order information about $\mathbf{W}$ and then decomposing an appropriate tensor. The key ingredient in our analysis, which may be of independent interest, is to show that such a matrix $\mathbf{W}$ has full column rank with high probability as soon as $m = \widetilde{\Omega}(r)$, which we do using tools from Littlewood-Offord theory and estimates for binary Krawtchouk polynomials.
    Improved Multi-objective Data Stream Clustering with Time and Memory Optimization. (arXiv:2201.05079v1 [cs.LG])
    (2 min) The analysis of data streams has received considerable attention over the past few decades due to sensors, social media, etc. It aims to recognize patterns in an unordered, infinite, and evolving stream of observations. Clustering this type of data requires some restrictions in time and memory. This paper introduces a new data stream clustering method (IMOC-Stream). This method, unlike the other clustering algorithms, uses two different objective functions to capture different aspects of the data. The goal of IMOC-Stream is to: 1) reduce computation time by using idle times to apply genetic operations and enhance the solution. 2) reduce memory allocation by introducing a new tree synopsis. 3) find arbitrarily shaped clusters by using a multi-objective framework. We conducted an experimental study with high dimensional stream datasets and compared them to well-known stream clustering techniques. The experiments show the ability of our method to partition the data stream in arbitrarily shaped, compact, and well-separated clusters while optimizing the time and memory. Our method also outperformed most of the stream algorithms in terms of NMI and ARAND measures.
    Generalized Shape Metrics on Neural Representations. (arXiv:2110.14739v2 [stat.ML] UPDATED)
    (2 min) Understanding the operation of biological and artificial networks remains a difficult and important challenge. To identify general principles, researchers are increasingly interested in surveying large collections of networks that are trained on, or biologically adapted to, similar tasks. A standardized set of analysis tools is now needed to identify how network-level covariates -- such as architecture, anatomical brain region, and model organism -- impact neural representations (hidden layer activations). Here, we provide a rigorous foundation for these analyses by defining a broad family of metric spaces that quantify representational dissimilarity. Using this framework we modify existing representational similarity measures based on canonical correlation analysis to satisfy the triangle inequality, formulate a novel metric that respects the inductive biases in convolutional layers, and identify approximate Euclidean embeddings that enable network representations to be incorporated into essentially any off-the-shelf machine learning method. We demonstrate these methods on large-scale datasets from biology (Allen Institute Brain Observatory) and deep learning (NAS-Bench-101). In doing so, we identify relationships between neural representations that are interpretable in terms of anatomical features and model performance.
    Neural Koopman Lyapunov Control. (arXiv:2201.05098v1 [eess.SY])
    (2 min) Learning and synthesizing stabilizing controllers for unknown nonlinear systems is a challenging problem for real-world and industrial applications. Koopman operator theory allow one to analyze nonlinear systems through the lens of linear systems and nonlinear control systems through the lens of bilinear control systems. The key idea of these methods, lies in the transformation of the coordinates of the nonlinear system into the Koopman observables, which are coordinates that allow the representation of the original system (control system) as a higher dimensional linear (bilinear control) system. However, for nonlinear control systems, the bilinear control model obtained by applying Koopman operator based learning methods is not necessarily stabilizable and therefore, the existence of a stabilizing feedback control is not guaranteed which is crucial for many real world applications. Simultaneous identification of these stabilizable Koopman based bilinear control systems as well as the associated Koopman observables is still an open problem. In this paper, we propose a framework to identify and construct these stabilizable bilinear models and its associated observables from data by simultaneously learning a bilinear Koopman embedding for the underlying unknown nonlinear control system as well as a Control Lyapunov Function (CLF) for the Koopman based bilinear model using a learner and falsifier. Our proposed approach thereby provides provable guarantees of global asymptotic stability for the nonlinear control systems with unknown dynamics. Numerical simulations are provided to validate the efficacy of our proposed class of stabilizing feedback controllers for unknown nonlinear systems.
    The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. (arXiv:2201.05149v1 [cs.LG])
    (2 min) Successful deep learning models often involve training neural network architectures that contain more parameters than the number of training samples. Such overparametrized models have been extensively studied in recent years, and the virtues of overparametrization have been established from both the statistical perspective, via the double-descent phenomenon, and the computational perspective via the structural properties of the optimization landscape. Despite the remarkable success of deep learning architectures in the overparametrized regime, it is also well known that these models are highly vulnerable to small adversarial perturbations in their inputs. Even when adversarially trained, their performance on perturbed inputs (robust generalization) is considerably worse than their best attainable performance on benign inputs (standard generalization). It is thus imperative to understand how overparametrization fundamentally affects robustness. In this paper, we will provide a precise characterization of the role of overparametrization on robustness by focusing on random features regression models (two-layer neural networks with random first layer weights). We consider a regime where the sample size, the input dimension and the number of parameters grow in proportion to each other, and derive an asymptotically exact formula for the robust generalization error when the model is adversarially trained. Our developed theory reveals the nontrivial effect of overparametrization on robustness and indicates that for adversarially trained random features models, high overparametrization can hurt robust generalization.
    Quasi-Framelets: Another Improvement to GraphNeural Networks. (arXiv:2201.04728v1 [cs.LG])
    (2 min) This paper aims to provide a novel design of a multiscale framelets convolution for spectral graph neural networks. In the spectral paradigm, spectral GNNs improve graph learning task performance via proposing various spectral filters in spectral domain to capture both global and local graph structure information. Although the existing spectral approaches show superior performance in some graphs, they suffer from lack of flexibility and being fragile when graph information are incomplete or perturbated. Our new framelets convolution incorporates the filtering func-tions directly designed in the spectral domain to overcome these limitations. The proposed convolution shows a great flexibility in cutting-off spectral information and effectively mitigate the negative effect of noisy graph signals. Besides, to exploit the heterogeneity in real-world graph data, the heterogeneous graph neural network with our new framelet convolution provides a solution for embedding the intrinsic topological information of meta-path with a multi-level graph analysis.Extensive experiments have been conducted on real-world heterogeneous graphs and homogeneous graphs under settings with noisy node features and superior performance results are achieved.
    Fish sounds: towards the evaluation of marine acoustic biodiversity through data-driven audio source separation. (arXiv:2201.05013v1 [cs.SD])
    (2 min) The marine ecosystem is changing at an alarming rate, exhibiting biodiversity loss and the migration of tropical species to temperate basins. Monitoring the underwater environments and their inhabitants is of fundamental importance to understand the evolution of these systems and implement safeguard policies. However, assessing and tracking biodiversity is often a complex task, especially in large and uncontrolled environments, such as the oceans. One of the most popular and effective methods for monitoring marine biodiversity is passive acoustics monitoring (PAM), which employs hydrophones to capture underwater sound. Many aquatic animals produce sounds characteristic of their own species; these signals travel efficiently underwater and can be detected even at great distances. Furthermore, modern technologies are becoming more and more convenient and precise, allowing for very accurate and careful data acquisition. To date, audio captured with PAM devices is frequently manually processed by marine biologists and interpreted with traditional signal processing techniques for the detection of animal vocalizations. This is a challenging task, as PAM recordings are often over long periods of time. Moreover, one of the causes of biodiversity loss is sound pollution; in data obtained from regions with loud anthropic noise, it is hard to separate the artificial from the fish sound manually. Nowadays, machine learning and, in particular, deep learning represents the state of the art for processing audio signals. Specifically, sound separation networks are able to identify and separate human voices and musical instruments. In this work, we show that the same techniques can be successfully used to automatically extract fish vocalizations in PAM recordings, opening up the possibility for biodiversity monitoring at a large scale.
    AequeVox: Automated Fairness Testing of Speech Recognition Systems. (arXiv:2110.09843v2 [cs.LG] UPDATED)
    (2 min) Automatic Speech Recognition (ASR) systems have become ubiquitous. They can be found in a variety of form factors and are increasingly important in our daily lives. As such, ensuring that these systems are equitable to different subgroups of the population is crucial. In this paper, we introduce, AequeVox, an automated testing framework for evaluating the fairness of ASR systems. AequeVox simulates different environments to assess the effectiveness of ASR systems for different populations. In addition, we investigate whether the chosen simulations are comprehensible to humans. We further propose a fault localization technique capable of identifying words that are not robust to these varying environments. Both components of AequeVox are able to operate in the absence of ground truth data. We evaluated AequeVox on speech from four different datasets using three different commercial ASRs. Our experiments reveal that non-native English, female and Nigerian English speakers generate 109%, 528.5% and 156.9% more errors, on average than native English, male and UK Midlands speakers, respectively. Our user study also reveals that 82.9% of the simulations (employed through speech transformations) had a comprehensibility rating above seven (out of ten), with the lowest rating being 6.78. This further validates the fairness violations discovered by AequeVox. Finally, we show that the non-robust words, as predicted by the fault localization technique embodied in AequeVox, show 223.8% more errors than the predicted robust words across all ASRs.
    How Tight Can PAC-Bayes be in the Small Data Regime?. (arXiv:2106.03542v4 [stat.ML] UPDATED)
    (2 min) In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We find that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.
    GradMax: Growing Neural Networks using Gradient Information. (arXiv:2201.05125v1 [cs.LG])
    (2 min) The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.
    Distributed Cooperative Multi-Agent Reinforcement Learning with Directed Coordination Graph. (arXiv:2201.04962v1 [cs.MA])
    (2 min) Existing distributed cooperative multi-agent reinforcement learning (MARL) frameworks usually assume undirected coordination graphs and communication graphs while estimating a global reward via consensus algorithms for policy evaluation. Such a framework may induce expensive communication costs and exhibit poor scalability due to requirement of global consensus. In this work, we study MARLs with directed coordination graphs, and propose a distributed RL algorithm where the local policy evaluations are based on local value functions. The local value function of each agent is obtained by local communication with its neighbors through a directed learning-induced communication graph, without using any consensus algorithm. A zeroth-order optimization (ZOO) approach based on parameter perturbation is employed to achieve gradient estimation. By comparing with existing ZOO-based RL algorithms, we show that our proposed distributed RL algorithm guarantees high scalability. A distributed resource allocation example is shown to illustrate the effectiveness of our algorithm.
    Continuous Herded Gibbs Sampling. (arXiv:2106.06430v2 [stat.ML] UPDATED)
    (2 min) Herding is a technique to sequentially generate deterministic samples from a probability distribution. In this work, we propose a continuous herded Gibbs sampler that combines kernel herding on continuous densities with the Gibbs sampling idea. Our algorithm allows for deterministically sampling from high-dimensional multivariate probability densities, without directly sampling from the joint density. Experiments with Gaussian mixture densities indicate that the L2 error decreases similarly to kernel herding, while the computation time is significantly lower, i.e., linear in the number of dimensions.
    Discrete-time Contraction-based Control of Nonlinear Systems with Parametric Uncertainties using Neural Networks. (arXiv:2105.05432v2 [eess.SY] UPDATED)
    (2 min) In response to the continuously changing feedstock supply and market demand for products with different specifications, the processes need to be operated at time-varying operating conditions and targets (e.g., setpoints) to improve the process economy, in contrast to traditional process operations around predetermined equilibriums. In this paper, a contraction theory-based control approach using neural networks is developed for nonlinear chemical processes to achieve time-varying reference tracking. This approach leverages the universal approximation characteristics of neural networks with discrete-time contraction analysis and control. It involves training a neural network to learn a contraction metric and differential feedback gain, that is embedded in a contraction-based controller. A second, separate neural network is also incorporated into the control-loop to perform online learning of uncertain system model parameters. The resulting control scheme is capable of achieving efficient offset-free tracking of time-varying references, with a full range of model uncertainty, without the need for controller structure redesign as the reference changes. This is a robust approach that can deal with bounded parametric uncertainties in the process model, which are commonly encountered in industrial (chemical) processes. This approach also ensures the process stability during online simultaneous learning and control. Simulation examples are provided to illustrate the above approach.
    Planning in Observable POMDPs in Quasipolynomial Time. (arXiv:2201.04735v1 [cs.LG])
    (2 min) Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: are there natural and well-motivated assumptions that make planning easy? Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to well-separated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -- i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.
    Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses. (arXiv:2106.05426v4 [cs.CL] UPDATED)
    (0 min) How related are the representations learned by neural language models, translation models, and language tagging tasks? We answer this question by adapting an encoder-decoder transfer learning method from computer vision to investigate the structure among 100 different feature spaces extracted from hidden representations of various networks trained on language tasks. This method reveals a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings. We call this low-dimensional structure a language representation embedding because it encodes the relationships between representations needed to process language for a variety of NLP tasks. We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI. Additionally, we find that the principal dimension of this structure can be used to create a metric which highlights the brain's natural language processing hierarchy. This suggests that the embedding captures some part of the brain's natural language representation structure.
    Partial Recovery in the Graph Alignment Problem. (arXiv:2007.00533v5 [stat.ML] UPDATED)
    (0 min) In this paper, we consider the graph alignment problem, which is the problem of recovering, given two graphs, a one-to-one mapping between nodes that maximizes edge overlap. This problem can be viewed as a noisy version of the well-known graph isomorphism problem and appears in many applications, including social network deanonymization and cellular biology. Our focus here is on partial recovery, i.e., we look for a one-to-one mapping which is correct on a fraction of the nodes of the graph rather than on all of them, and we assume that the two input graphs to the problem are correlated Erd\H{o}s-R\'enyi graphs of parameters $(n,q,s)$. Our main contribution is then to give necessary and sufficient conditions on $(n,q,s)$ under which partial recovery is possible with high probability as the number of nodes $n$ goes to infinity. In particular, we show that it is possible to achieve partial recovery in the $nqs=\Theta(1)$ regime under certain additional assumptions.
    Flood Prediction and Analysis on the Relevance of Features using Explainable Artificial Intelligence. (arXiv:2201.05046v1 [cs.LG])
    (0 min) This paper presents flood prediction models for the state of Kerala in India by analyzing the monthly rainfall data and applying machine learning algorithms including Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machine. Although these models have shown high accuracy prediction of the occurrence of flood in a particular year, they do not quantitatively and qualitatively explain the prediction decision. This paper shows how the background features are learned that contributed to the prediction decision and further extended to explain the inner workings with the development of explainable artificial intelligence modules. The obtained results have confirmed the validity of the findings uncovered by the explainer modules basing on the historical flood monthly rainfall data in Kerala.
    An Overview of Uncertainty Quantification Methods for Infinite Neural Networks. (arXiv:2201.04746v1 [cs.LG])
    (0 min) To better understand the theoretical behavior of large neural networks, several works have analyzed the case where a network's width tends to infinity. In this regime, the effect of random initialization and the process of training a neural network can be formally expressed with analytical tools like Gaussian processes and neural tangent kernels. In this paper, we review methods for quantifying uncertainty in such infinite-width neural networks and compare their relationship to Gaussian processes in the Bayesian inference framework. We make use of several equivalence results along the way to obtain exact closed-form solutions for predictive uncertainty.
    Class-Balanced Distillation for Long-Tailed Visual Recognition. (arXiv:2104.05279v2 [cs.CV] UPDATED)
    (0 min) Real-world imagery is often characterized by a significant imbalance of the number of images per class, leading to long-tailed distributions. An effective and simple approach to long-tailed visual recognition is to learn feature representations and a classifier separately, with instance and class-balanced sampling, respectively. In this work, we introduce a new framework, by making the key observation that a feature representation learned with instance sampling is far from optimal in a long-tailed setting. Our main contribution is a new training method, referred to as Class-Balanced Distillation (CBD), that leverages knowledge distillation to enhance feature representations. CBD allows the feature representation to evolve in the second training stage, guided by the teacher learned in the first stage. The second stage uses class-balanced sampling, in order to focus on under-represented classes. This framework can naturally accommodate the usage of multiple teachers, unlocking the information from an ensemble of models to enhance recognition capabilities. Our experiments show that the proposed technique consistently outperforms the state of the art on long-tailed recognition benchmarks such as ImageNet-LT, iNaturalist17 and iNaturalist18.
    On Stochastic Moving-Average Estimators for Non-Convex Optimization. (arXiv:2104.14840v3 [math.OC] UPDATED)
    (0 min) In this paper, we consider the widely used but not fully understood stochastic estimator based on moving average (SEMA), which only requires {\bf a general unbiased stochastic oracle}. We demonstrate the power of SEMA on a range of stochastic non-convex optimization problems. In particular, we analyze various stochastic methods (existing or newly proposed) based on the {\bf variance recursion property} of SEMA for three families of non-convex optimization, namely standard stochastic non-convex minimization, stochastic non-convex strongly-concave min-max optimization, and stochastic bilevel optimization. Our contributions include: (i) for standard stochastic non-convex minimization, we present a simple and intuitive proof of convergence for a family of Adam-style methods (including Adam, AMSGrad, AdaBound, etc.) with an increasing or large "momentum" parameter for the first-order moment, which gives an alternative yet more natural way to guarantee Adam converge; (ii) for stochastic non-convex strongly-concave min-max optimization, we present a single-loop primal-dual stochastic momentum and adaptive methods based on the moving average estimators and establish its oracle complexity of $O(1/\epsilon^4)$ without using a large mini-batch size, addressing a gap in the literature; (iii) for stochastic bilevel optimization, we present a single-loop stochastic method based on the moving average estimators and establish its oracle complexity of $\widetilde O(1/\epsilon^4)$ without computing the SVD of the Hessian matrix, improving state-of-the-art results. For all these problems, we also establish a variance diminishing result for the used stochastic gradient estimators.
    Hyperparameter Importance for Machine Learning Algorithms. (arXiv:2201.05132v1 [stat.ML])
    (0 min) Hyperparameter plays an essential role in the fitting of supervised machine learning algorithms. However, it is computationally expensive to tune all the tunable hyperparameters simultaneously especially for large data sets. In this paper, we give a definition of hyperparameter importance that can be estimated by subsampling procedures. According to the importance, hyperparameters can then be tuned on the entire data set more efficiently. We show theoretically that the proposed importance on subsets of data is consistent with the one on the population data under weak conditions. Numerical experiments show that the proposed importance is consistent and can save a lot of computational resources.
    Fractal Gaussian Networks: A sparse random graph model based on Gaussian Multiplicative Chaos. (arXiv:2008.03038v2 [stat.ML] UPDATED)
    (0 min) We propose a novel stochastic network model, called Fractal Gaussian Network (FGN), that embodies well-defined and analytically tractable fractal structures. Such fractal structures have been empirically observed in diverse applications. FGNs interpolate continuously between the popular purely random geometric graphs (a.k.a. the Poisson Boolean network), and random graphs with increasingly fractal behavior. In fact, they form a parametric family of sparse random geometric graphs that are parametrized by a fractality parameter which governs the strength of the fractal structure. FGNs are driven by the latent spatial geometry of Gaussian Multiplicative Chaos (GMC), a canonical model of fractality in its own right. We asymptotically characterize the expected number of edges, triangles, cliques and hub-and-spoke motifs in FGNs, unveiling a distinct pattern in their scaling with the size parameter of the network. We then examine the natural question of detecting the presence of fractality and the problem of parameter estimation based on observed network data, in addition to fundamental properties of the FGN as a random graph model. We also explore fractality in community structures by unveiling a natural stochastic block model in the setting of FGNs. Finally, we substantiate our results with phenomenological analysis of the FGN in the context of available scientific literature for fractality in networks, including applications to real-world massive network data.
    Deploying clinical machine learning? Consider the following.... (arXiv:2109.06919v2 [cs.LG] UPDATED)
    (0 min) Despite the intense attention and considerable investment into clinical machine learning research, relatively few applications have been deployed at a large-scale in a real-world clinical environment. While research is important in advancing the state-of-the-art, translation is equally important in bringing these techniques and technologies into a position to ultimately impact healthcare. We believe a lack of appreciation for several considerations are a major cause for this discrepancy between expectation and reality. To better characterize a holistic perspective among researchers and practitioners, we survey several practitioners with commercial experience in developing CML for clinical deployment. Using these insights, we identify several main categories of challenges in order to better design and develop clinical machine learning applications.
    Transferable Time-Series Forecasting under Causal Conditional Shift. (arXiv:2111.03422v2 [cs.LG] UPDATED)
    (0 min) This paper focuses on the problem of semi-supervised domain adaptation for time-series forecasting, which is underexplored in literatures, despite being often encountered in practice. Existing methods on time-series domain adaptation mainly follow the paradigm designed for the static data, which cannot handle domain-specific complex conditional dependencies raised by data offset, time lags, and variant data distributions. In order to address these challenges, we analyze variational conditional dependencies in time-series data and find that the causal structures are usually stable among domains, and further raise the causal conditional shift assumption. Enlightened by this assumption, we consider the causal generation process for time-series data and propose an end-to-end model for the semi-supervised domain adaptation problem on time-series forecasting. Our method can not only discover the Granger-Causal structures among cross-domain data but also address the cross-domain time-series forecasting problem with accurate and interpretable predicted results. We further theoretically analyze the superiority of the proposed method, where the generalization error on the target domain is bounded by the empirical risks and by the discrepancy between the causal structures from different domains. Experimental results on both synthetic and real data demonstrate the effectiveness of our method for the semi-supervised domain adaptation method on time-series forecasting.
    Rate Distortion Characteristic Modeling for Neural Image Compression. (arXiv:2106.12954v2 [eess.IV] UPDATED)
    (0 min) End-to-end optimized neural image compression (NIC) has obtained superior lossy compression performance recently. In this paper, we consider the problem of rate-distortion (R-D) characteristic analysis and modeling for NIC. We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep networks. Thus arbitrary bit-rate points could be elegantly realized by leveraging such model via a single trained network. We propose a plugin-in module to learn the relationship between the target bit-rate and the binary representation for the latent variable of auto-encoder. The proposed scheme resolves the problem of training distinct models to reach different points in the R-D space. Furthermore, we model the rate and distortion characteristic of NIC as a function of the coding parameter $\lambda$ respectively. Our experiments show our proposed method is easy to adopt and realizes state-of-the-art continuous bit-rate coding performance, which implies that our approach would benefit the practical deployment of NIC.
    REST: Debiased Social Recommendation via Reconstructing Exposure Strategies. (arXiv:2201.04952v1 [cs.LG])
    (0 min) The recommendation system, relying on historical observational data to model the complex relationships among the users and items, has achieved great success in real-world applications. Selection bias is one of the most important issues of the existing observational data based approaches, which is actually caused by multiple types of unobserved exposure strategies (e.g. promotions and holiday effects). Though various methods have been proposed to address this problem, they are mainly relying on the implicit debiasing techniques but not explicitly modeling the unobserved exposure strategies. By explicitly Reconstructing Exposure STrategies (REST in short), we formalize the recommendation problem as the counterfactual reasoning and propose the debiased social recommendation method. In REST, we assume that the exposure of an item is controlled by the latent exposure strategies, the user, and the item. Based on the above generation process, we first provide the theoretical guarantee of our method via identification analysis. Second, we employ a variational auto-encoder to reconstruct the latent exposure strategies, with the help of the social networks and the items. Third, we devise a counterfactual reasoning based recommendation algorithm by leveraging the recovered exposure strategies. Experiments on four real-world datasets, including three published datasets and one private WeChat Official Account dataset, demonstrate significant improvements over several state-of-the-art methods.
    Unifying Epidemic Models with Mixtures. (arXiv:2201.04960v1 [stat.ML])
    (0 min) The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges the two approaches while retaining benefits of both. The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models. Although the model is non-mechanistic, we show that it arises as the natural outcome of a stochastic process based on a networked SIR framework. This allows learned parameters to take on a more meaningful interpretation compared to similar non-mechanistic models, and we validate the interpretations using auxiliary mobility data collected during the COVID-19 pandemic. We provide a simple learning algorithm to identify model parameters and establish theoretical results which show the model can be efficiently learned from data. Empirically, we find the model to have low prediction error. The model is available live at covidpredictions.mit.edu. Ultimately, this allows us to systematically understand the impacts of interventions on COVID-19, which is critical in developing data-driven solutions to controlling epidemics.
    Adherence Forecasting for Guided Internet-Delivered Cognitive Behavioral Therapy: A Minimally Data-Sensitive Approach. (arXiv:2201.04967v1 [cs.LG])
    (0 min) Internet-delivered psychological treatments (IDPT) are seen as an effective and scalable pathway to improving the accessibility of mental healthcare. Within this context, treatment adherence is an especially relevant challenge to address due to the reduced interaction between healthcare professionals and patients, compared to more traditional interventions. In parallel, there are increasing regulations when using peoples' personal data, especially in the digital sphere. In such regulations, data minimization is often a core tenant such as within the General Data Protection Regulation (GDPR). Consequently, this work proposes a deep-learning approach to perform automatic adherence forecasting, while only relying on minimally sensitive login/logout data. This approach was tested on a dataset containing 342 patients undergoing guided internet-delivered cognitive behavioral therapy (G-ICBT) treatment. The proposed Self-Attention Network achieved over 70% average balanced accuracy, when only 1/3 of the treatment duration had elapsed. As such, this study demonstrates that automatic adherence forecasting for G-ICBT, is achievable using only minimally sensitive data, thus facilitating the implementation of such tools within real-world IDPT platforms.
    Deep Learning on Multimodal Sensor Data at the Wireless Edge for Vehicular Network. (arXiv:2201.04712v1 [cs.LG])
    (0 min) Beam selection for millimeter-wave links in a vehicular scenario is a challenging problem, as an exhaustive search among all candidate beam pairs cannot be assuredly completed within short contact times. We solve this problem via a novel expediting beam selection by leveraging multimodal data collected from sensors like LiDAR, camera images, and GPS. We propose individual modality and distributed fusion-based deep learning (F-DL) architectures that can execute locally as well as at a mobile edge computing center (MEC), with a study on associated tradeoffs. We also formulate and solve an optimization problem that considers practical beam-searching, MEC processing and sensor-to-MEC data delivery latency overheads for determining the output dimensions of the above F-DL architectures. Results from extensive evaluations conducted on publicly available synthetic and home-grown real-world datasets reveal 95% and 96% improvement in beam selection speed over classical RF-only beam sweeping, respectively. F-DL also outperforms the state-of-the-art techniques by 20-22% in predicting top-10 best beam pairs.
    Using Computational Intelligence for solving the Ornstein-Zernike equation. (arXiv:2201.05089v1 [cond-mat.soft])
    (0 min) The main goal of this thesis is to provide an exploration of the use of computational intelligence techniques to study the numerical solution of the Ornstein-Zernike equation for simple liquids. In particular, a continuous model of the hard sphere fluid is studied. There are two main proposals in this contribution. First, the use of neural networks as a way to parametrize closure relation when solving the Ornstein-Zernike equation. It is explicitly shown that in the case of the hard sphere fluid, the neural network approach seems to reduce to the so-called Hypernetted Chain closure. For the second proposal, we explore the fact that if more physical information is incorporated into the theoretical formalism, a better estimate can be obtained with the use of evolutionary optimization techniques. When choosing the modified Verlet closure relation, and leaving a couple of free parameters to be adjusted, the results are as good as those obtained from molecular simulations. The thesis is then closed with a brief summary of the main findings and outlooks on different ways to improve the proposals presented here.
    A Method for Estimating the Entropy of Time Series Using Artificial Neural Networks. (arXiv:2107.08399v4 [cs.LG] UPDATED)
    (0 min) Measuring the predictability and complexity of time series using entropy is essential tool de-signing and controlling a nonlinear system. However, the existing methods have some drawbacks related to the strong dependence of entropy on the parameters of the methods. To overcome these difficulties, this study proposes a new method for estimating the entropy of a time series using the LogNNet neural network model. The LogNNet reservoir matrix is filled with time series elements according to our algorithm. The accuracy of the classification of images from the MNIST-10 database is considered as the entropy measure and denoted by NNetEn. The novelty of entropy calculation is that the time series is involved in mixing the input information in the res-ervoir. Greater complexity in the time series leads to a higher classification accuracy and higher NNetEn values. We introduce a new time series characteristic called time series learning inertia that determines the learning rate of the neural network. The robustness and efficiency of the method is verified on chaotic, periodic, random, binary, and constant time series. The comparison of NNetEn with other methods of entropy estimation demonstrates that our method is more robust and accurate and can be widely used in practice.
    Data-Driven Modeling and Prediction of Non-Linearizable Dynamics via Spectral Submanifolds. (arXiv:2201.04976v1 [math.DS])
    (0 min) We develop a methodology to construct low-dimensional predictive models from data sets representing essentially nonlinear (or non-linearizable) dynamical systems with a hyperbolic linear part that are subject to external forcing with finitely many frequencies. Our data-driven, sparse, nonlinear models are obtained as extended normal forms of the reduced dynamics on low-dimensional, attracting spectral submanifolds (SSMs) of the dynamical system. We illustrate the power of data-driven SSM reduction on high-dimensional numerical data sets and experimental measurements involving beam oscillations, vortex shedding and sloshing in a water tank. We find that SSM reduction trained on unforced data also predicts nonlinear response accurately under additional external forcing.
    Sparse Spiking Gradient Descent. (arXiv:2105.08810v2 [cs.NE] UPDATED)
    (0 min) There is an increasing interest in emulating Spiking Neural Networks (SNNs) on neuromorphic computing devices due to their low energy consumption. Recent advances have allowed training SNNs to a point where they start to compete with traditional Artificial Neural Networks (ANNs) in terms of accuracy, while at the same time being energy efficient when run on neuromorphic hardware. However, the process of training SNNs is still based on dense tensor operations originally developed for ANNs which do not leverage the spatiotemporally sparse nature of SNNs. We present here the first sparse SNN backpropagation algorithm which achieves the same or better accuracy as current state of the art methods while being significantly faster and more memory efficient. We show the effectiveness of our method on real datasets of varying complexity (Fashion-MNIST, Neuromophic-MNIST and Spiking Heidelberg Digits) achieving a speedup in the backward pass of up to 150x, and 85% more memory efficient, without losing accuracy.
    A robust kernel machine regression towards biomarker selection in multi-omics datasets of osteoporosis for drug discovery. (arXiv:2201.05060v1 [stat.ML])
    (0 min) Many statistical machine approaches could ultimately highlight novel features of the etiology of complex diseases by analyzing multi-omics data. However, they are sensitive to some deviations in distribution when the observed samples are potentially contaminated with adversarial corrupted outliers (e.g., a fictional data distribution). Likewise, statistical advances lag in supporting comprehensive data-driven analyses of complex multi-omics data integration. We propose a novel non-linear M-estimator-based approach, "robust kernel machine regression (RobKMR)," to improve the robustness of statistical machine regression and the diversity of fictional data to examine the higher-order composite effect of multi-omics datasets. We address a robust kernel-centered Gram matrix to estimate the model parameters accurately. We also propose a robust score test to assess the marginal and joint Hadamard product of features from multi-omics data. We apply our proposed approach to a multi-omics dataset of osteoporosis (OP) from Caucasian females. Experiments demonstrate that the proposed approach effectively identifies the inter-related risk factors of OP. With solid evidence (p-value = 0.00001), biological validations, network-based analysis, causal inference, and drug repurposing, the selected three triplets ((DKK1, SMTN, DRGX), (MTND5, FASTKD2, CSMD3), (MTND5, COG3, CSMD3)) are significant biomarkers and directly relate to BMD. Overall, the top three selected genes (DKK1, MTND5, FASTKD2) and one gene (SIDT1 at p-value= 0.001) significantly bond with four drugs- Tacrolimus, Ibandronate, Alendronate, and Bazedoxifene out of 30 candidates for drug repurposing in OP. Further, the proposed approach can be applied to any disease model where multi-omics datasets are available.
    Weakly Supervised Scene Text Detection using Deep Reinforcement Learning. (arXiv:2201.04866v1 [cs.CV])
    (0 min) The challenging field of scene text detection requires complex data annotation, which is time-consuming and expensive. Techniques, such as weak supervision, can reduce the amount of data needed. In this paper we propose a weak supervision method for scene text detection, which makes use of reinforcement learning (RL). The reward received by the RL agent is estimated by a neural network, instead of being inferred from ground-truth labels. First, we enhance an existing supervised RL approach to text detection with several training optimizations, allowing us to close the performance gap to regression-based algorithms. We then use our proposed system in a weakly- and semi-supervised training on real-world data. Our results show that training in a weakly supervised setting is feasible. However, we find that using our model in a semi-supervised setting , e.g. when combining labeled synthetic data with unannotated real-world data, produces the best results.
    Safe Policies for Reinforcement Learning via Primal-Dual Methods. (arXiv:1911.09101v2 [eess.SY] UPDATED)
    (0 min) In this paper, we study the learning of safe policies in the setting of reinforcement learning problems. This is, we aim to control a Markov Decision Process (MDP) of which we do not know the transition probabilities, but we have access to sample trajectories through experience. We define safety as the agent remaining in a desired safe set with high probability during the operation time. We therefore consider a constrained MDP where the constraints are probabilistic. Since there is no straightforward way to optimize the policy with respect to the probabilistic constraint in a reinforcement learning framework, we propose an ergodic relaxation of the problem. The advantages of the proposed relaxation are threefold. (i) The safety guarantees are maintained in the case of episodic tasks and they are kept up to a given time horizon for continuing tasks. (ii) The constrained optimization problem despite its non-convexity has arbitrarily small duality gap if the parametrization of the policy is rich enough. (iii) The gradients of the Lagrangian associated with the safe-learning problem can be easily computed using standard policy gradient results and stochastic approximation tools. Leveraging these advantages, we establish that primal-dual algorithms are able to find policies that are safe and optimal. We test the proposed approach in a navigation task in a continuous domain. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.
    A Geometric Approach to $k$-means. (arXiv:2201.04822v1 [cs.LG])
    (0 min) $k$-means clustering is a fundamental problem in various disciplines. This problem is nonconvex, and standard algorithms are only guaranteed to find a local optimum. Leveraging the structure of local solutions characterized in [1], we propose a general algorithmic framework for escaping undesirable local solutions and recovering the global solution (or the ground truth). This framework consists of alternating between the following two steps iteratively: (i) detect mis-specified clusters in a local solution and (ii) improve the current local solution by non-local operations. We discuss implementation of these steps, and elucidate how the proposed framework unifies variants of $k$-means algorithm in literature from a geometric perspective. In addition, we introduce two natural extensions of the proposed framework, where the initial number of clusters is misspecified. We provide theoretical justification for our approach, which is corroborated with extensive experiments.
    The Recurrent Reinforcement Learning Crypto Agent. (arXiv:2201.04699v1 [cs.LG])
    (0 min) We demonstrate an application of online transfer learning as a digital assets trading agent. This agent makes use of a powerful feature space representation in the form of an echo state network, the output of which is made available to a direct, recurrent reinforcement learning agent. The agent learns to trade the XBTUSD (Bitcoin versus US dollars) perpetual swap derivatives contract on BitMEX. It learns to trade intraday on five minutely sampled data, avoids excessive over-trading, captures a funding profit and is also able to predict the direction of the market. Overall, our crypto agent realises a total return of 350%, net of transaction costs, over roughly five years, 71% of which is down to funding profit. The annualised information ratio that it achieves is 1.46.
    Local2Global: A distributed approach for scaling representation learning on graphs. (arXiv:2201.04729v1 [cs.LG])
    (0 min) We propose a decentralised "local2global"' approach to graph representation learning, that one can a-priori use to scale any embedding technique. Our local2global approach proceeds by first dividing the input graph into overlapping subgraphs (or "patches") and training local representations for each patch independently. In a second step, we combine the local representations into a globally consistent representation by estimating the set of rigid motions that best align the local representations using information from the patch overlaps, via group synchronization. A key distinguishing feature of local2global relative to existing work is that patches are trained independently without the need for the often costly parameter synchronization during distributed training. This allows local2global to scale to large-scale industrial applications, where the input graph may not even fit into memory and may be stored in a distributed manner. We apply local2global on data sets of different sizes and show that our approach achieves a good trade-off between scale and accuracy on edge reconstruction and semi-supervised classification. We also consider the downstream task of anomaly detection and show how one can use local2global to highlight anomalies in cybersecurity networks.
    Privacy Amplification by Subsampling in Time Domain. (arXiv:2201.04762v1 [cs.CR])
    (0 min) Aggregate time-series data like traffic flow and site occupancy repeatedly sample statistics from a population across time. Such data can be profoundly useful for understanding trends within a given population, but also pose a significant privacy risk, potentially revealing e.g., who spends time where. Producing a private version of a time-series satisfying the standard definition of Differential Privacy (DP) is challenging due to the large influence a single participant can have on the sequence: if an individual can contribute to each time step, the amount of additive noise needed to satisfy privacy increases linearly with the number of time steps sampled. As such, if a signal spans a long duration or is oversampled, an excessive amount of noise must be added, drowning out underlying trends. However, in many applications an individual realistically cannot participate at every time step. When this is the case, we observe that the influence of a single participant (sensitivity) can be reduced by subsampling and/or filtering in time, while still meeting privacy requirements. Using a novel analysis, we show this significant reduction in sensitivity and propose a corresponding class of privacy mechanisms. We demonstrate the utility benefits of these techniques empirically with real-world and synthetic time-series data.
    A Survey on Masked Facial Detection Methods and Datasets for Fighting Against COVID-19. (arXiv:2201.04777v1 [cs.CV])
    (0 min) Coronavirus disease 2019 (COVID-19) continues to pose a great challenge to the world since its outbreak. To fight against the disease, a series of artificial intelligence (AI) techniques are developed and applied to real-world scenarios such as safety monitoring, disease diagnosis, infection risk assessment, lesion segmentation of COVID-19 CT scans,etc. The coronavirus epidemics have forced people wear masks to counteract the transmission of virus, which also brings difficulties to monitor large groups of people wearing masks. In this paper, we primarily focus on the AI techniques of masked facial detection and related datasets. We survey the recent advances, beginning with the descriptions of masked facial detection datasets. Thirteen available datasets are described and discussed in details. Then, the methods are roughly categorized into two classes: conventional methods and neural network-based methods. Conventional methods are usually trained by boosting algorithms with hand-crafted features, which accounts for a small proportion. Neural network-based methods are further classified as three parts according to the number of processing stages. Representative algorithms are described in detail, coupled with some typical techniques that are described briefly. Finally, we summarize the recent benchmarking results, give the discussions on the limitations of datasets and methods, and expand future research directions. To our knowledge, this is the first survey about masked facial detection methods and datasets. Hopefully our survey could provide some help to fight against epidemics.
    Self-semantic contour adaptation for cross modality brain tumor segmentation. (arXiv:2201.05022v1 [cs.CV])
    (0 min) Unsupervised domain adaptation (UDA) between two significantly disparate domains to learn high-level semantic alignment is a crucial yet challenging task.~To this end, in this work, we propose exploiting low-level edge information to facilitate the adaptation as a precursor task, which has a small cross-domain gap, compared with semantic segmentation.~The precise contour then provides spatial information to guide the semantic adaptation. More specifically, we propose a multi-task framework to learn a contouring adaptation network along with a semantic segmentation adaptation network, which takes both magnetic resonance imaging (MRI) slice and its initial edge map as input.~These two networks are jointly trained with source domain labels, and the feature and edge map level adversarial learning is carried out for cross-domain alignment. In addition, self-entropy minimization is incorporated to further enhance segmentation performance. We evaluated our framework on the BraTS2018 database for cross-modality segmentation of brain tumors, showing the validity and superiority of our approach, compared with competing methods.
    CLSRIL-23: Cross Lingual Speech Representations for Indic Languages. (arXiv:2107.07402v2 [cs.CL] UPDATED)
    (0 min) We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5% is observed in WER and 9.5% in CER when a multilingual pretrained model is used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on $23$ languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.
    Metric-Free Individual Fairness in Online Learning. (arXiv:2002.05474v5 [cs.LG] UPDATED)
    (0 min) We study an online learning problem subject to the constraint of individual fairness, which requires that similar individuals are treated similarly. Unlike prior work on individual fairness, we do not assume the similarity measure among individuals is known, nor do we assume that such measure takes a certain parametric form. Instead, we leverage the existence of an auditor who detects fairness violations without enunciating the quantitative measure. In each round, the auditor examines the learner's decisions and attempts to identify a pair of individuals that are treated unfairly by the learner. We provide a general reduction framework that reduces online classification in our model to standard online classification, which allows us to leverage existing online learning algorithms to achieve sub-linear regret and number of fairness violations. Surprisingly, in the stochastic setting where the data are drawn independently from a distribution, we are also able to establish PAC-style fairness and accuracy generalization guarantees (Yona and Rothblum [2018]), despite only having access to a very restricted form of fairness feedback. Our fairness generalization bound qualitatively matches the uniform convergence bound of Yona and Rothblum [2018], while also providing a meaningful accuracy generalization guarantee. Our results resolve an open question by Gillen et al. [2018] by showing that online learning under an unknown individual fairness constraint is possible even without assuming a strong parametric form of the underlying similarity measure.
    Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?. (arXiv:2201.05119v1 [cs.CV])
    (2 min) Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from Mitrovic et al., 2021, we propose ReLICv2 which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views. ReLICv2 achieves 77.1% top-1 classification accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Most notably, ReLICv2 is the first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison using a range of standard ResNet architectures. Finally we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.
    Adversarially Robust Classification by Conditional Generative Model Inversion. (arXiv:2201.04733v1 [cs.LG])
    (2 min) Most adversarial attack defense methods rely on obfuscating gradients. These methods are successful in defending against gradient-based attacks; however, they are easily circumvented by attacks which either do not use the gradient or by attacks which approximate and use the corrected gradient. Defenses that do not obfuscate gradients such as adversarial training exist, but these approaches generally make assumptions about the attack such as its magnitude. We propose a classification model that does not obfuscate gradients and is robust by construction without assuming prior knowledge about the attack. Our method casts classification as an optimization problem where we "invert" a conditional generator trained on unperturbed, natural images to find the class that generates the closest sample to the query image. We hypothesize that a potential source of brittleness against adversarial attacks is the high-to-low-dimensional nature of feed-forward classifiers which allows an adversary to find small perturbations in the input space that lead to large changes in the output space. On the other hand, a generative model is typically a low-to-high-dimensional mapping. While the method is related to Defense-GAN, the use of a conditional generative model and inversion in our model instead of the feed-forward classifier is a critical difference. Unlike Defense-GAN, which was shown to generate obfuscated gradients that are easily circumvented, we show that our method does not obfuscate gradients. We demonstrate that our model is extremely robust against black-box attacks and has improved robustness against white-box attacks compared to naturally trained, feed-forward classifiers.
    Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting. (arXiv:2201.04828v1 [cs.LG])
    (2 min) Multivariate time series (MTS) forecasting plays an important role in the automation and optimization of intelligent applications. It is a challenging task, as we need to consider both complex intra-variable dependencies and inter-variable dependencies. Existing works only learn temporal patterns with the help of single inter-variable dependencies. However, there are multi-scale temporal patterns in many real-world MTS. Single inter-variable dependencies make the model prefer to learn one type of prominent and shared temporal patterns. In this paper, we propose a multi-scale adaptive graph neural network (MAGNN) to address the above issue. MAGNN exploits a multi-scale pyramid network to preserve the underlying temporal dependencies at different time scales. Since the inter-variable dependencies may be different under distinct time scales, an adaptive graph learning module is designed to infer the scale-specific inter-variable dependencies without pre-defined priors. Given the multi-scale feature representations and scale-specific inter-variable dependencies, a multi-scale temporal graph neural network is introduced to jointly model intra-variable dependencies and inter-variable dependencies. After that, we develop a scale-wise fusion module to effectively promote the collaboration across different time scales, and automatically capture the importance of contributed temporal patterns. Experiments on four real-world datasets demonstrate that MAGNN outperforms the state-of-the-art methods across various settings.
    Toddler-Guidance Learning: Impacts of Critical Period on Multimodal AI Agents. (arXiv:2201.04990v1 [cs.LG])
    (2 min) Critical periods are phases during which a toddler's brain develops in spurts. To promote children's cognitive development, proper guidance is critical in this stage. However, it is not clear whether such a critical period also exists for the training of AI agents. Similar to human toddlers, well-timed guidance and multimodal interactions might significantly enhance the training efficiency of AI agents as well. To validate this hypothesis, we adapt this notion of critical periods to learning in AI agents and investigate the critical period in the virtual environment for AI agents. We formalize the critical period and Toddler-guidance learning in the reinforcement learning (RL) framework. Then, we built up a toddler-like environment with VECA toolkit to mimic human toddlers' learning characteristics. We study three discrete levels of mutual interaction: weak-mentor guidance (sparse reward), moderate mentor guidance (helper-reward), and mentor demonstration (behavioral cloning). We also introduce the EAVE dataset consisting of 30,000 real-world images to fully reflect the toddler's viewpoint. We evaluate the impact of critical periods on AI agents from two perspectives: how and when they are guided best in both uni- and multimodal learning. Our experimental results show that both uni- and multimodal agents with moderate mentor guidance and critical period on 1 million and 2 million training steps show a noticeable improvement. We validate these results with transfer learning on the EAVE dataset and find the performance advancement on the same critical period and the guidance.
    Solving Dynamic Graph Problems with Multi-Attention Deep Reinforcement Learning. (arXiv:2201.04895v1 [cs.LG])
    (2 min) Graph problems such as traveling salesman problem, or finding minimal Steiner trees are widely studied and used in data engineering and computer science. Typically, in real-world applications, the features of the graph tend to change over time, thus, finding a solution to the problem becomes challenging. The dynamic version of many graph problems are the key for a plethora of real-world problems in transportation, telecommunication, and social networks. In recent years, using deep learning techniques to find heuristic solutions for NP-hard graph combinatorial problems has gained much interest as these learned heuristics can find near-optimal solutions efficiently. However, most of the existing methods for learning heuristics focus on static graph problems. The dynamic nature makes NP-hard graph problems much more challenging to learn, and the existing methods fail to find reasonable solutions. In this paper, we propose a novel architecture named Graph Temporal Attention with Reinforcement Learning (GTA-RL) to learn heuristic solutions for graph-based dynamic combinatorial optimization problems. The GTA-RL architecture consists of an encoder capable of embedding temporal features of a combinatorial problem instance and a decoder capable of dynamically focusing on the embedded features to find a solution to a given combinatorial problem instance. We then extend our architecture to learn heuristics for the real-time version of combinatorial optimization problems where all input features of a problem are not known a prior, but rather learned in real-time. Our experimental results against several state-of-the-art learning-based algorithms and optimal solvers demonstrate that our approach outperforms the state-of-the-art learning-based approaches in terms of effectiveness and optimal solvers in terms of efficiency on dynamic and real-time graph combinatorial optimization.
    Automated Reinforcement Learning: An Overview. (arXiv:2201.05000v1 [cs.LG])
    (2 min) Reinforcement Learning and recently Deep Reinforcement Learning are popular methods for solving sequential decision making problems modeled as Markov Decision Processes. RL modeling of a problem and selecting algorithms and hyper-parameters require careful considerations as different configurations may entail completely different performances. These considerations are mainly the task of RL experts; however, RL is progressively becoming popular in other fields where the researchers and system designers are not RL experts. Besides, many modeling decisions, such as defining state and action space, size of batches and frequency of batch updating, and number of timesteps are typically made manually. For these reasons, automating different components of RL framework is of great importance and it has attracted much attention in recent years. Automated RL provides a framework in which different components of RL including MDP modeling, algorithm selection and hyper-parameter optimization are modeled and defined automatically. In this article, we explore the literature and present recent work that can be used in automated RL. Moreover, we discuss the challenges, open questions and research directions in AutoRL.
    Automatic Sparse Connectivity Learning for Neural Networks. (arXiv:2201.05020v1 [cs.CV])
    (2 min) Since sparse neural networks usually contain many zero weights, these unnecessary network connections can potentially be eliminated without degrading network performance. Therefore, well-designed sparse neural networks have the potential to significantly reduce FLOPs and computational resources. In this work, we propose a new automatic pruning method - Sparse Connectivity Learning (SCL). Specifically, a weight is re-parameterized as an element-wise multiplication of a trainable weight variable and a binary mask. Thus, network connectivity is fully described by the binary mask, which is modulated by a unit step function. We theoretically prove the fundamental principle of using a straight-through estimator (STE) for network pruning. This principle is that the proxy gradients of STE should be positive, ensuring that mask variables converge at their minima. After finding Leaky ReLU, Softplus, and Identity STEs can satisfy this principle, we propose to adopt Identity STE in SCL for discrete mask relaxation. We find that mask gradients of different features are very unbalanced, hence, we propose to normalize mask gradients of each feature to optimize mask variable training. In order to automatically train sparse masks, we include the total number of network connections as a regularization term in our objective function. As SCL does not require pruning criteria or hyper-parameters defined by designers for network layers, the network is explored in a larger hypothesis space to achieve optimized sparse connectivity for the best performance. SCL overcomes the limitations of existing automatic pruning methods. Experimental results demonstrate that SCL can automatically learn and select important network connections for various baseline network structures. Deep learning models trained by SCL outperform the SOTA human-designed and automatic pruning methods in sparsity, accuracy, and FLOPs reduction.
    Multi-Stage Hybrid Federated Learning over Large-Scale D2D-Enabled Fog Networks. (arXiv:2007.09511v5 [cs.NI] UPDATED)
    (3 min) Federated learning has generated significant interest, with nearly all works focused on a "star" topology where nodes/devices are each connected to a central server. We migrate away from this architecture and extend it through the network dimension to the case where there are multiple layers of nodes between the end devices and the server. Specifically, we develop multi-stage hybrid federated learning (MH-FL), a hybrid of intra- and inter-layer model learning that considers the network as a multi-layer cluster-based structure. MH-FL considers the topology structures among the nodes in the clusters, including local networks formed via device-to-device (D2D) communications, and presumes a semi-decentralized architecture for federated learning. It orchestrates the devices at different network layers in a collaborative/cooperative manner (i.e., using D2D interactions) to form local consensus on the model parameters and combines it with multi-stage parameter relaying between layers of the tree-shaped hierarchy. We derive the upper bound of convergence for MH-FL with respect to parameters of the network topology (e.g., the spectral radius) and the learning algorithm (e.g., the number of D2D rounds in different clusters). We obtain a set of policies for the D2D rounds at different clusters to guarantee either a finite optimality gap or convergence to the global optimum. We then develop a distributed control algorithm for MH-FL to tune the D2D rounds in each cluster over time to meet specific convergence criteria. Our experiments on real-world datasets verify our analytical results and demonstrate the advantages of MH-FL in terms of resource utilization metrics.
    Spatiotemporal Clustering with Neyman-Scott Processes via Connections to Bayesian Nonparametric Mixture Models. (arXiv:2201.05044v1 [stat.ML])
    (2 min) Neyman-Scott process (NSP) are point process models that generate clusters of points in time or space. They are natural models for a wide range of phenomena, ranging from neural spike trains to document streams. The clustering property is achieved via a doubly stochastic formulation: first, a set of latent events is drawn from a Poisson process; then, each latent event generates a set of observed data points according to another Poisson process. This construction is similar to Bayesian nonparametric mixture models like the Dirichlet process mixture model (DPMM) in that the number of latent events (i.e. clusters) is a random variable, but the point process formulation makes the NSP especially well suited to modeling spatiotemporal data. While many specialized algorithms have been developed for DPMMs, comparatively fewer works have focused on inference in NSPs. Here, we present novel connections between NSPs and DPMMs, with the key link being a third class of Bayesian mixture models called mixture of finite mixture models (MFMMs). Leveraging this connection, we adapt the standard collapsed Gibbs sampling algorithm for DPMMs to enable scalable Bayesian inference on NSP models. We demonstrate the potential of Neyman-Scott processes on a variety of applications including sequence detection in neural spike trains and event detection in document streams.
    SeamlessGAN: Self-Supervised Synthesis of Tileable Texture Maps. (arXiv:2201.05120v1 [cs.CV])
    (2 min) We present SeamlessGAN, a method capable of automatically generating tileable texture maps from a single input exemplar. In contrast to most existing methods, focused solely on solving the synthesis problem, our work tackles both problems, synthesis and tileability, simultaneously. Our key idea is to realize that tiling a latent space within a generative network trained using adversarial expansion techniques produces outputs with continuity at the seam intersection that can be then be turned into tileable images by cropping the central area. Since not every value of the latent space is valid to produce high-quality outputs, we leverage the discriminator as a perceptual error metric capable of identifying artifact-free textures during a sampling process. Further, in contrast to previous work on deep texture synthesis, our model is designed and optimized to work with multi-layered texture representations, enabling textures composed of multiple maps such as albedo, normals, etc. We extensively test our design choices for the network architecture, loss function and sampling parameters. We show qualitatively and quantitatively that our approach outperforms previous methods and works for textures of different types.
    Evaluation of Four Black-box Adversarial Attacks and Some Query-efficient Improvement Analysis. (arXiv:2201.05001v1 [cs.CR])
    (2 min) With the fast development of machine learning technologies, deep learning models have been deployed in almost every aspect of everyday life. However, the privacy and security of these models are threatened by adversarial attacks. Among which black-box attack is closer to reality, where limited knowledge can be acquired from the model. In this paper, we provided basic background knowledge about adversarial attack and analyzed four black-box attack algorithms: Bandits, NES, Square Attack and ZOsignSGD comprehensively. We also explored the newly proposed Square Attack method with respect to square size, hoping to improve its query efficiency.
    Multi-View Non-negative Matrix Factorization Discriminant Learning via Cross Entropy Loss. (arXiv:2201.04726v1 [cs.LG])
    (2 min) Multi-view learning accomplishes the task objectives of classification by leverag-ing the relationships between different views of the same object. Most existing methods usually focus on consistency and complementarity between multiple views. But not all of this information is useful for classification tasks. Instead, it is the specific discriminating information that plays an important role. Zhong Zhang et al. explore the discriminative and non-discriminative information exist-ing in common and view-specific parts among different views via joint non-negative matrix factorization. In this paper, we improve this algorithm on this ba-sis by using the cross entropy loss function to constrain the objective function better. At last, we implement better classification effect than original on the same data sets and show its superiority over many state-of-the-art algorithms.
    Stock Movement Prediction Based on Bi-typed and Hybrid-relational Market Knowledge Graph via Dual Attention Networks. (arXiv:2201.04965v1 [q-fin.ST])
    (2 min) Stock Movement Prediction (SMP) aims at predicting listed companies' stock future price trend, which is a challenging task due to the volatile nature of financial markets. Recent financial studies show that the momentum spillover effect plays a significant role in stock fluctuation. However, previous studies typically only learn the simple connection information among related companies, which inevitably fail to model complex relations of listed companies in the real financial market. To address this issue, we first construct a more comprehensive Market Knowledge Graph (MKG) which contains bi-typed entities including listed companies and their associated executives, and hybrid-relations including the explicit relations and implicit relations. Afterward, we propose DanSmp, a novel Dual Attention Networks to learn the momentum spillover signals based upon the constructed MKG for stock prediction. The empirical experiments on our constructed datasets against nine SOTA baselines demonstrate that the proposed DanSmp is capable of improving stock prediction with the constructed MKG.
    Non-Stationary Representation Learning in Sequential Linear Bandits. (arXiv:2201.04805v1 [cs.LG])
    (2 min) In this paper, we study representation learning for multi-task decision-making in non-stationary environments. We consider the framework of sequential linear bandits, where the agent performs a series of tasks drawn from distinct sets associated with different environments. The embeddings of tasks in each set share a low-dimensional feature extractor called representation, and representations are different across sets. We propose an online algorithm that facilitates efficient decision-making by learning and transferring non-stationary representations in an adaptive fashion. We prove that our algorithm significantly outperforms the existing ones that treat tasks independently. We also conduct experiments using both synthetic and real data to validate our theoretical insights and demonstrate the efficacy of our algorithm.
    How Can Graph Neural Networks Help Document Retrieval: A Case Study on CORD19 with Concept Map Generation. (arXiv:2201.04672v1 [cs.IR])
    (2 min) Graph neural networks (GNNs), as a group of powerful tools for representation learning on irregular data, have manifested superiority in various downstream tasks. With unstructured texts represented as concept maps, GNNs can be exploited for tasks like document retrieval. Intrigued by how can GNNs help document retrieval, we conduct an empirical study on a large-scale multi-discipline dataset CORD-19. Results show that instead of the complex structure-oriented GNNs such as GINs and GATs, our proposed semantics-oriented graph functions achieve better and more stable performance based on the BM25 retrieved candidates. Our insights in this case study can serve as a guideline for future work to develop effective GNNs with appropriate semantics-oriented inductive biases for textual reasoning tasks like document retrieval and classification. All code for this case study is available at https://github.com/HennyJie/GNN-DocRetrieval.
    When Machine Learning Meets Spectrum Sharing Security: Methodologies and Challenges. (arXiv:2201.04677v1 [cs.CR])
    (2 min) The exponential growth of internet connected systems has generated numerous challenges, such as spectrum shortage issues, which require efficient spectrum sharing (SS) solutions. Complicated and dynamic SS systems can be exposed to different potential security and privacy issues, requiring protection mechanisms to be adaptive, reliable, and scalable. Machine learning (ML) based methods have frequently been proposed to address those issues. In this article, we provide a comprehensive survey of the recent development of ML based SS methods, the most critical security issues, and corresponding defense mechanisms. In particular, we elaborate the state-of-the-art methodologies for improving the performance of SS communication systems for various vital aspects, including ML based cognitive radio networks (CRNs), ML based database assisted SS networks, ML based LTE-U networks, ML based ambient backscatter networks, and other ML based SS solutions. We also present security issues from the physical layer and corresponding defending strategies based on ML algorithms, including Primary User Emulation (PUE) attacks, Spectrum Sensing Data Falsification (SSDF) attacks, jamming attacks, eavesdropping attacks, and privacy issues. Finally, extensive discussions on open challenges for ML based SS are also given. This comprehensive review is intended to provide the foundation for and facilitate future studies on exploring the potential of emerging ML for coping with increasingly complex SS and their security problems.
    Learning Without a Global Clock: Asynchronous Learning in a Physics-Driven Learning Network. (arXiv:2201.04626v1 [cond-mat.soft])
    (2 min) In a neuron network, synapses update individually using local information, allowing for entirely decentralized learning. In contrast, elements in an artificial neural network (ANN) are typically updated simultaneously using a central processor. Here we investigate the feasibility and effect of asynchronous learning in a recently introduced decentralized, physics-driven learning network. We show that desynchronizing the learning process does not degrade performance for a variety of tasks in an idealized simulation. In experiment, desynchronization actually improves performance by allowing the system to better explore the discretized state space of solutions. We draw an analogy between asynchronicity and mini-batching in stochastic gradient descent, and show that they have similar effects on the learning process. Desynchronizing the learning process establishes physics-driven learning networks as truly fully distributed learning machines, promoting better performance and scalability in deployment.
    On Sampling Collaborative Filtering Datasets. (arXiv:2201.04768v1 [cs.LG])
    (2 min) We study the practical consequences of dataset sampling strategies on the ranking performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance. Following this observation, this paper makes three main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc.); (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tailed interaction data; and (3) developing an oracle, Data-Genie, which can suggest the sampling scheme that is most likely to preserve model performance for a given dataset. The main benefit of Data-Genie is that it will allow recommender system practitioners to quickly prototype and compare various approaches, while remaining confident that algorithm performance will be preserved, once the algorithm is retrained and deployed on the complete data. Detailed experiments show that using Data-Genie, we can discard upto 5x more data than any sampling strategy with the same level of performance.
    Sparsely Changing Latent States for Prediction and Planning in Partially Observable Domains. (arXiv:2110.15949v2 [cs.LG] UPDATED)
    (2 min) A common approach to prediction and planning in partially observable domains is to use recurrent neural networks (RNNs), which ideally develop and maintain a latent memory about hidden, task-relevant factors. We hypothesize that many of these hidden factors in the physical world are constant over time, changing only sparsely. To study this hypothesis, we propose Gated $L_0$ Regularized Dynamics (GateL0RD), a novel recurrent architecture that incorporates the inductive bias to maintain stable, sparsely changing latent states. The bias is implemented by means of a novel internal gating function and a penalty on the $L_0$ norm of latent state changes. We demonstrate that GateL0RD can compete with or outperform state-of-the-art RNNs in a variety of partially observable prediction and control tasks. GateL0RD tends to encode the underlying generative factors of the environment, ignores spurious temporal dependencies, and generalizes better, improving sampling efficiency and overall performance in model-based planning and reinforcement learning tasks. Moreover, we show that the developing latent states can be easily interpreted, which is a step towards better explainability in RNNs.
    Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering. (arXiv:2201.05077v1 [cs.SE])
    (2 min) Deep neural networks (DNNs) have demonstrated superior performance over classical machine learning to support many features in safety-critical systems. Although DNNs are now widely used in such systems (e.g., self driving cars), there is limited progress regarding automated support for functional safety analysis in DNN-based systems. For example, the identification of root causes of errors, to enable both risk analysis and DNN retraining, remains an open problem. In this paper, we propose SAFE, a black-box approach to automatically characterize the root causes of DNN errors. SAFE relies on a transfer learning model pre-trained on ImageNet to extract the features from error-inducing images. It then applies a density-based clustering algorithm to detect arbitrary shaped clusters of images modeling plausible causes of error. Last, clusters are used to effectively retrain and improve the DNN. The black-box nature of SAFE is motivated by our objective not to require changes or even access to the DNN internals to facilitate adoption. Experimental results show the superior ability of SAFE in identifying different root causes of DNN errors based on case studies in the automotive domain. It also yields significant improvements in DNN accuracy after retraining, while saving significant execution time and memory when compared to alternatives.
    Recursive Least Squares Policy Control with Echo State Network. (arXiv:2201.04781v1 [cs.LG])
    (2 min) The echo state network (ESN) is a special type of recurrent neural networks for processing the time-series dataset. However, limited by the strong correlation among sequential samples of the agent, ESN-based policy control algorithms are difficult to use the recursive least squares (RLS) algorithm to update the ESN's parameters. To solve this problem, we propose two novel policy control algorithms, ESNRLS-Q and ESNRLS-Sarsa. Firstly, to reduce the correlation of training samples, we use the leaky integrator ESN and the mini-batch learning mode. Secondly, to make RLS suitable for training ESN in mini-batch mode, we present a new mean-approximation method for updating the RLS correlation matrix. Thirdly, to prevent ESN from over-fitting, we use the L1 regularization technique. Lastly, to prevent the target state-action value from overestimation, we employ the Mellowmax method. Simulation results show that our algorithms have good convergence performance.
    Multi-echelon Supply Chains with Uncertain Seasonal Demands and Lead Times Using Deep Reinforcement Learning. (arXiv:2201.04651v1 [cs.LG])
    (2 min) We address the problem of production planning and distribution in multi-echelon supply chains. We consider uncertain demands and lead times which makes the problem stochastic and non-linear. A Markov Decision Process formulation and a Non-linear Programming model are presented. As a sequential decision-making problem, Deep Reinforcement Learning (RL) is a possible solution approach. This type of technique has gained a lot of attention from Artificial Intelligence and Optimization communities in recent years. Considering the good results obtained with Deep RL approaches in different areas there is a growing interest in applying them in problems from the Operations Research field. We have used a Deep RL technique, namely Proximal Policy Optimization (PPO2), to solve the problem considering uncertain, regular and seasonal demands and constant or stochastic lead times. Experiments are carried out in different scenarios to better assess the suitability of the algorithm. An agent based on a linearized model is used as a baseline. Experimental results indicate that PPO2 is a competitive and adequate tool for this type of problem. PPO2 agent is better than baseline in all scenarios with stochastic lead times (7.3-11.2%), regardless of whether demands are seasonal or not. In scenarios with constant lead times, the PPO2 agent is better when uncertain demands are non-seasonal (2.2-4.7%). The results show that the greater the uncertainty of the scenario, the greater the viability of this type of approach.
    Certifiable Robustness for Nearest Neighbor Classifiers. (arXiv:2201.04770v1 [cs.LG])
    (2 min) ML models are typically trained using large datasets of high quality. However, training datasets often contain inconsistent or incomplete data. To tackle this issue, one solution is to develop algorithms that can check whether a prediction of a model is certifiably robust. Given a learning algorithm that produces a classifier and given an example at test time, a classification outcome is certifiably robust if it is predicted by every model trained across all possible worlds (repairs) of the uncertain (inconsistent) dataset. This notion of robustness falls naturally under the framework of certain answers. In this paper, we study the complexity of certifying robustness for a simple but widely deployed classification algorithm, $k$-Nearest Neighbors ($k$-NN). Our main focus is on inconsistent datasets when the integrity constraints are functional dependencies (FDs). For this setting, we establish a dichotomy in the complexity of certifying robustness w.r.t. the set of FDs: the problem either admits a polynomial time algorithm, or it is coNP-hard. Additionally, we exhibit a similar dichotomy for the counting version of the problem, where the goal is to count the number of possible worlds that predict a certain label. As a byproduct of our study, we also establish the complexity of a problem related to finding an optimal subset repair that may be of independent interest.
    Direct Mutation and Crossover in Genetic Algorithms Applied to Reinforcement Learning Tasks. (arXiv:2201.04815v1 [cs.NE])
    (2 min) Neuroevolution has recently been shown to be quite competitive in reinforcement learning (RL) settings, and is able to alleviate some of the drawbacks of gradient-based approaches. This paper will focus on applying neuroevolution using a simple genetic algorithm (GA) to find the weights of a neural network that produce optimally behaving agents. In addition, we present two novel modifications that improve the data efficiency and speed of convergence when compared to the initial implementation. The modifications are evaluated on the FrozenLake environment provided by OpenAI gym and prove to be significantly better than the baseline approach.
    On the Design of Graph Embeddings for the Sensorless Estimation of Road Traffic Profiles. (arXiv:2201.04968v1 [cs.LG])
    (0 min) Traffic forecasting models rely on data that needs to be sensed, processed, and stored. This requires the deployment and maintenance of traffic sensing infrastructure, often leading to unaffordable monetary costs. The lack of sensed locations can be complemented with synthetic data simulations that further lower the economical investment needed for traffic monitoring. One of the most common data generative approaches consists of producing real-like traffic patterns, according to data distributions from analogous roads. The process of detecting roads with similar traffic is the key point of these systems. However, without collecting data at the target location no flow metrics can be employed for this similarity-based search. We present a method to discover locations among those with available traffic data by inspecting topological features of road segments. Relevant topological features are extracted as numerical representations (embeddings) to compare different locations and eventually find the most similar roads based on the similarity between their embeddings. The performance of this novel selection system is examined and compared to simpler traffic estimation approaches. After finding a similar source of data, a generative method is used to synthesize traffic profiles. Depending on the resemblance of the traffic behavior at the sensed road, the generation method can be fed with data from one road only. Several generation approaches are analyzed in terms of the precision of the synthesized samples. Above all, this work intends to stimulate further research efforts towards enhancing the quality of synthetic traffic samples and thereby, reducing the need for sensing infrastructure.
    Criticality-Based Varying Step-Number Algorithm for Reinforcement Learning. (arXiv:2201.05034v1 [cs.LG])
    (0 min) In the context of reinforcement learning we introduce the concept of criticality of a state, which indicates the extent to which the choice of action in that particular state influences the expected return. That is, a state in which the choice of action is more likely to influence the final outcome is considered as more critical than a state in which it is less likely to influence the final outcome. We formulate a criticality-based varying step number algorithm (CVS) - a flexible step number algorithm that utilizes the criticality function provided by a human, or learned directly from the environment. We test it in three different domains including the Atari Pong environment, Road-Tree environment, and Shooter environment. We demonstrate that CVS is able to outperform popular learning algorithms such as Deep Q-Learning and Monte Carlo.
    Multi-task longitudinal forecasting with missing values on Alzheimer's Disease. (arXiv:2201.05040v1 [stat.ML])
    (0 min) Machine learning techniques typically applied to dementia forecasting lack in their capabilities to jointly learn several tasks, handle time dependent heterogeneous data and missing values. In this paper, we propose a framework using the recently presented SSHIBA model for jointly learning different tasks on longitudinal data with missing values. The method uses Bayesian variational inference to impute missing values and combine information of several views. This way, we can combine different data-views from different time-points in a common latent space and learn the relations between each time-point while simultaneously modelling and predicting several output variables. We apply this model to predict together diagnosis, ventricle volume, and clinical scores in dementia. The results demonstrate that SSHIBA is capable of learning a good imputation of the missing values and outperforming the baselines while simultaneously predicting three different tasks.
    VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements. (arXiv:2112.10893v2 [cs.SE] UPDATED)
    (0 min) Automatically locating vulnerable statements in source code is crucial to assure software security and alleviate developers' debugging efforts. This becomes even more important in today's software ecosystem, where vulnerable code can flow easily and unwittingly within and across software repositories like GitHub. Across such millions of lines of code, traditional static and dynamic approaches struggle to scale. Although existing machine-learning-based approaches look promising in such a setting, most work detects vulnerable code at a higher granularity -- at the method or file level. Thus, developers still need to inspect a significant amount of code to locate the vulnerable statement(s) that need to be fixed. This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph and effectively understand code semantics and vulnerable patterns. To study VELVET's effectiveness, we use an off-the-shelf synthetic dataset and a recently published real-world dataset. In the static analysis setting, where vulnerable functions are not detected in advance, VELVET achieves 4.5x better performance than the baseline static analyzers on the real-world data. For the isolated vulnerability localization task, where we assume the vulnerability of a function is known while the specific vulnerable statement is unknown, we compare VELVET with several neural networks that also attend to local and global context of code. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively, outperforming the baseline deep-learning models by 5.3-29.0%.
    Combining Interventional and Observational Data Using Causal Reductions. (arXiv:2103.04786v2 [stat.ML] UPDATED)
    (2 min) Unobserved confounding is one of the main challenges when estimating causal effects. We propose a causal reduction method that, given a causal model, replaces an arbitrary number of possibly high-dimensional latent confounders with a single latent confounder that takes values in the same space as the treatment variable, without changing the observational and interventional distributions the causal model entails. This allows us to estimate the causal effect in a principled way from combined data without relying on the common but often unrealistic assumption that all confounders have been observed. We apply our causal reduction in three different settings. In the first setting, we assume the treatment and outcome to be discrete. The causal reduction then implies bounds between the observational and interventional distributions that can be exploited for estimation purposes. In certain cases with highly unbalanced observational samples, the accuracy of the causal effect estimate can be improved by incorporating observational data. Second, for continuous variables and assuming a linear-Gaussian model, we derive equality constraints for the parameters of the observational and interventional distributions. Third, for the general continuous setting (possibly nonlinear or non-Gaussian), we parameterize the reduced causal model using normalizing flows, a flexible class of easily invertible nonlinear transformations. We perform a series of experiments on synthetic data and find that in several cases the number of interventional samples can be reduced when adding observational training samples without sacrificing accuracy.
    Evaluation of Neural Networks Defenses and Attacks using NDCG and Reciprocal Rank Metrics. (arXiv:2201.05071v1 [cs.CR])
    (2 min) The problem of attacks on neural networks through input modification (i.e., adversarial examples) has attracted much attention recently. Being relatively easy to generate and hard to detect, these attacks pose a security breach that many suggested defenses try to mitigate. However, the evaluation of the effect of attacks and defenses commonly relies on traditional classification metrics, without adequate adaptation to adversarial scenarios. Most of these metrics are accuracy-based, and therefore may have a limited scope and low distinctive power. Other metrics do not consider the unique characteristics of neural networks functionality, or measure the effect of the attacks indirectly (e.g., through the complexity of their generation). In this paper, we present two metrics which are specifically designed to measure the effect of attacks, or the recovery effect of defenses, on the output of neural networks in multiclass classification tasks. Inspired by the normalized discounted cumulative gain and the reciprocal rank metrics used in information retrieval literature, we treat the neural network predictions as ranked lists of results. Using additional information about the probability of the rank enabled us to define novel metrics that are suited to the task at hand. We evaluate our metrics using various attacks and defenses on a pretrained VGG19 model and the ImageNet dataset. Compared to the common classification metrics, our proposed metrics demonstrate superior informativeness and distinctiveness.
    Pointwise Binary Classification with Pairwise Confidence Comparisons. (arXiv:2010.01875v4 [cs.LG] UPDATED)
    (2 min) To alleviate the data requirement for training effective binary classifiers in binary classification, many weakly supervised learning settings have been proposed. Among them, some consider using pairwise but not pointwise labels, when pointwise labels are not accessible due to privacy, confidentiality, or security reasons. However, as a pairwise label denotes whether or not two data points share a pointwise label, it cannot be easily collected if either point is equally likely to be positive or negative. Thus, in this paper, we propose a novel setting called pairwise comparison (Pcomp) classification, where we have only pairs of unlabeled data that we know one is more likely to be positive than the other. Firstly, we give a Pcomp data generation process, derive an unbiased risk estimator (URE) with theoretical guarantee, and further improve URE using correction functions. Secondly, we link Pcomp classification to noisy-label learning to develop a progressive URE and improve it by imposing consistency regularization. Finally, we demonstrate by experiments the effectiveness of our methods, which suggests Pcomp is a valuable and practically useful type of pairwise supervision besides the pairwise label.
    An improved LogNNet classifier for IoT application. (arXiv:2105.14412v2 [cs.LG] UPDATED)
    (2 min) In the age of neural networks and Internet of Things (IoT), the search for new neural network architectures capable of operating on devices with limited computing power and small memory size is becoming an urgent agenda. Designing suitable algorithms for IoT applications is an important task. The paper proposes a feed forward LogNNet neural network, which uses a semi-linear Henon type discrete chaotic map to classify MNIST-10 dataset. The model is composed of reservoir part and trainable classifier. The aim of the reservoir part is transforming the inputs to maximize the classification accuracy using a special matrix filing method and a time series generated by the chaotic map. The parameters of the chaotic map are optimized using particle swarm optimization with random immigrants. As a result, the proposed LogNNet/Henon classifier has higher accuracy and the same RAM usage, compared to the original version of LogNNet, and offers promising opportunities for implementation in IoT devices. In addition, a direct relation between the value of entropy and accuracy of the classification is demonstrated.
    Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs. (arXiv:2110.05518v2 [cs.LG] UPDATED)
    (2 min) Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple three-layer ReLU sub-networks with weight decay regularization can be equivalently cast as a convex optimization problem in a higher dimensional space, where sparsity is enforced via a group $\ell_1$-norm regularization. Consequently, ReLU networks can be interpreted as high dimensional feature selection methods. More importantly, we then prove that the equivalent convex problem can be globally optimized by a standard convex optimization solver with a polynomial-time complexity with respect to the number of samples and data dimension when the width of the network is fixed. Finally, we numerically validate our theoretical results via experiments involving both synthetic and real datasets.
    Unlabeled Data Improves Adversarial Robustness. (arXiv:1905.13736v4 [stat.ML] UPDATED)
    (2 min) We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $\ell_\infty$ robustness against several strong attacks via adversarial training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via randomized smoothing. On SVHN, adding the dataset's own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels.
    Ensemble Augmentation for Deep Neural Networks Using 1-D Time Series Vibration Data. (arXiv:2108.03288v2 [cs.LG] UPDATED)
    (2 min) Time-series data are one of the fundamental types of raw data representation used in data-driven techniques. In machine condition monitoring, time-series vibration data are overly used in data mining for deep neural networks. Typically, vibration data is converted into images for classification using Deep Neural Networks (DNNs), and scalograms are the most effective form of image representation. However, the DNN classifiers require huge labeled training samples to reach their optimum performance. So, many forms of data augmentation techniques are applied to the classifiers to compensate for the lack of training samples. However, the scalograms are graphical representations where the existing augmentation techniques suffer because they either change the graphical meaning or have too much noise in the samples that change the physical meaning. In this study, a data augmentation technique named ensemble augmentation is proposed to overcome this limitation. This augmentation method uses the power of white noise added in ensembles to the original samples to generate real-like samples. After averaging the signal with ensembles, a new signal is obtained that contains the characteristics of the original signal. The parameters for the ensemble augmentation are validated using a simulated signal. The proposed method is evaluated using 10 class bearing vibration data using three state-of-the-art Transfer Learning (TL) models, namely, Inception-V3, MobileNet-V2, and ResNet50. Augmented samples are generated in two increments: the first increment generates the same number of fake samples as the training samples, and in the second increment, the number of samples is increased gradually. The outputs from the proposed method are compared with no augmentation, augmentations using deep convolution generative adversarial network (DCGAN), and several geometric transformation-based augmentations...
    Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks. (arXiv:2201.04753v1 [math.PR])
    (2 min) This paper is concerned with the asymptotic distribution of the largest eigenvalues for some nonlinear random matrix ensemble stemming from the study of neural networks. More precisely we consider $M= \frac{1}{m} YY^\top$ with $Y=f(WX)$ where $W$ and $X$ are random rectangular matrices with i.i.d. centered entries. This models the data covariance matrix or the Conjugate Kernel of a single layered random Feed-Forward Neural Network. The function $f$ is applied entrywise and can be seen as the activation function of the neural network. We show that the largest eigenvalue has the same limit (in probability) as that of some well-known linear random matrix ensembles. In particular, we relate the asymptotic limit of the largest eigenvalue for the nonlinear model to that of an information-plus-noise random matrix, establishing a possible phase transition depending on the function $f$ and the distribution of $W$ and $X$. This may be of interest for applications to machine learning.
    Privacy-Utility Trades in Crowdsourced Signal Map Obfuscation. (arXiv:2201.04782v1 [cs.CR])
    (2 min) Cellular providers and data aggregating companies crowdsource celluar signal strength measurements from user devices to generate signal maps, which can be used to improve network performance. Recognizing that this data collection may be at odds with growing awareness of privacy concerns, we consider obfuscating such data before the data leaves the mobile device. The goal is to increase privacy such that it is difficult to recover sensitive features from the obfuscated data (e.g. user ids and user whereabouts), while still allowing network providers to use the data for improving network services (i.e. create accurate signal maps). To examine this privacy-utility tradeoff, we identify privacy and utility metrics and threat models suited to signal strength measurements. We then obfuscate the measurements using several preeminent techniques, spanning differential privacy, generative adversarial privacy, and information-theoretic privacy techniques, in order to benchmark a variety of promising obfuscation approaches and provide guidance to real-world engineers who are tasked to build signal maps that protect privacy without hurting utility. Our evaluation results, based on multiple, diverse, real-world signal map datasets, demonstrate the feasibility of concurrently achieving adequate privacy and utility, with obfuscation strategies which use the structure and intended use of datasets in their design, and target average-case, rather than worst-case, guarantees.
    1-Dimensional polynomial neural networks for audio signal related problems. (arXiv:2009.04077v2 [eess.AS] UPDATED)
    (2 min) In addition to being extremely non-linear, modern problems require millions if not billions of parameters to solve or at least to get a good approximation of the solution, and neural networks are known to assimilate that complexity by deepening and widening their topology in order to increase the level of non-linearity needed for a better approximation. However, compact topologies are always preferred to deeper ones as they offer the advantage of using less computational units and less parameters. This compacity comes at the price of reduced non-linearity and thus, of limited solution search space. We propose the 1-Dimensional Polynomial Neural Network (1DPNN) model that uses automatic polynomial kernel estimation for 1-Dimensional Convolutional Neural Networks (1DCNNs) and that introduces a high degree of non-linearity from the first layer which can compensate the need for deep and/or wide topologies. We show that this non-linearity enables the model to yield better results with less computational and spatial complexity than a regular 1DCNN on various classification and regression problems related to audio signals, even though it introduces more computational and spatial complexity on a neuronal level. The experiments were conducted on three publicly available datasets and demonstrate that, on the problems that were tackled, the proposed model can extract more relevant information from the data than a 1DCNN in less time and with less memory.
    Interpretable Automated Diagnosis of Retinal Disease Using Deep OCT Analysis. (arXiv:2109.02436v2 [eess.IV] UPDATED)
    (2 min) 30 million Optical Coherence Tomography (OCT) imaging tests are issued every year to diagnose various retinal diseases, but accurate diagnosis of OCT scans requires trained ophthalmologists who are still prone to making errors. With better systems for diagnosis, many cases of vision loss caused by retinal disease could be entirely avoided. In this work, we develop a novel deep learning architecture for explainable, accurate classification of retinal disease which achieves state-of-the-art accuracy. Furthermore, we place an emphasis on producing both qualitative and quantitative explanations of the model's decisions. Our algorithm produces heatmaps indicating the exact regions in the OCT scan the model focused on when making its decision. In combination with an OCT segmentation model, this allows us to produce quantitative breakdowns of the specific retinal layers the model focused on for later review by an expert. Our work is the first to produce detailed quantitative explanations of the model's decisions in this way. Our combination of accuracy and interpretability can be clinically applied for better patient care.
    Towards a trustworthy, secure and reliable enclave for machine learning in a hospital setting: The Essen Medical Computing Platform (EMCP). (arXiv:2201.04816v1 [cs.CR])
    (2 min) AI/Computing at scale is a difficult problem, especially in a health care setting. We outline the requirements, planning and implementation choices as well as the guiding principles that led to the implementation of our secure research computing enclave, the Essen Medical Computing Platform (EMCP), affiliated with a major German hospital. Compliance, data privacy and usability were the immutable requirements of the system. We will discuss the features of our computing enclave and we will provide our recipe for groups wishing to adopt a similar setup.
    Emojich -- zero-shot emoji generation using Russian language: a technical report. (arXiv:2112.02448v2 [cs.CL] UPDATED)
    (2 min) This technical report presents a text-to-image neural network "Emojich" that generates emojis using captions in Russian language as a condition. We aim to keep the generalization ability of a pretrained big model ruDALL-E Malevich (XL) 1.3B parameters at the fine-tuning stage, while giving special style to the images generated. Here are presented some engineering methods, code realization, all hyper-parameters for reproducing results and a Telegram bot where everyone can create their own customized sets of stickers. Also, some newly generated emojis obtained by "Emojich" model are demonstrated.
    Improving VAE based molecular representations for compound property prediction. (arXiv:2201.04929v1 [cs.LG])
    (2 min) Collecting labeled data for many important tasks in chemoinformatics is time consuming and requires expensive experiments. In recent years, machine learning has been used to learn rich representations of molecules using large scale unlabeled molecular datasets and transfer the knowledge to solve the more challenging tasks with limited datasets. Variational autoencoders are one of the tools that have been proposed to perform the transfer for both chemical property prediction and molecular generation tasks. In this work we propose a simple method to improve chemical property prediction performance of machine learning models by incorporating additional information on correlated molecular descriptors in the representations learned by variational autoencoders. We verify the method on three property prediction asks. We explore the impact of the number of incorporated descriptors, correlation between the descriptors and the target properties, sizes of the datasets etc. Finally, we show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset in the representation space.
    Online 3D Bin Packing with Constrained Deep Reinforcement Learning. (arXiv:2006.14978v5 [cs.LG] UPDATED)
    (2 min) We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into the bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of collision avoidance and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process. To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a feasibility predictor to predict the feasibility mask for the placement actions and use it to modulate the action probabilities output by the actor during training. Such supervisions and transformations to DRL facilitate the agent to learn feasible policies efficiently. Our method can also be generalized e.g., with the ability to handle lookahead or items with different orientations. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A user study suggests that our method attains a human-level performance.
    Generative time series models using Neural ODE in Variational Autoencoders. (arXiv:2201.04630v1 [cs.LG])
    (2 min) In this paper, we implement Neural Ordinary Differential Equations in a Variational Autoencoder setting for generative time series modeling. An object-oriented approach to the code was taken to allow for easier development and research and all code used in the paper can be found here: https://github.com/simonmoesorensen/neural-ode-project The results were initially recreated and the reconstructions compared to a baseline Long-Short Term Memory AutoEncoder. The model was then extended with a LSTM encoder and challenged by more complex data consisting of time series in the form of spring oscillations. The model showed promise, and was able to reconstruct true trajectories for all complexities of data with a smaller RMSE than the baseline model. However, it was able to capture the dynamic behavior of the time series for known data in the decoder but was not able to produce extrapolations following the true trajectory very well for any of the complexities of spring data. A final experiment was carried out where the model was also presented with 68 days of solar power production data, and was able to reconstruct just as well as the baseline, even when very little data is available. Finally, the models training time was compared to the baseline. It was found that for small amounts of data the NODE method was significantly slower at training than the baseline, while for larger amounts of data the NODE method would be equal or faster at training. The paper is ended with a future work section which describes the many natural extensions to the work presented in this paper, with examples being investigating further the importance of input data, including extrapolation in the baseline model or testing more specific model setups.
    Can we imitate the principal investor's behavior to learn option price?. (arXiv:2105.11376v2 [q-fin.PR] UPDATED)
    (2 min) This paper presents a framework of imitating the principal investor's behavior for optimal pricing and hedging options. We construct a non-deterministic Markov decision process for modeling stock price change driven by the principal investor's decision making. However, low signal-to-noise ratio and instability that are inherent in equity markets pose challenges to determine the state transition (stock price change) after executing an action (the principal investor's decision) as well as decide an action based on current state (spot price). In order to conquer these challenges, we resort to a Bayesian deep neural network for computing the predictive distribution of the state transition led by an action. Additionally, instead of exploring a state-action relationship to formulate a policy, we seek for an episode based visible-hidden state-action relationship to probabilistically imitate the principal investor's successive decision making. Unlike conventional option pricing that employs analytical stochastic processes or utilizes time series analysis to model and sample underlying stock price movements, our algorithm simulates stock price paths by imitating the principal investor's behavior which requires no preset probability distribution and fewer predetermined parameters. Eventually the optimal option price is learned by reinforcement learning to maximize the cumulative risk-adjusted return of a dynamically hedged portfolio over simulated price paths.
    Detection of brain tumors using machine learning algorithms. (arXiv:2201.04703v1 [eess.IV])
    (2 min) An algorithm capable of processing NMR images was developed for analysis using machine learning techniques to detect the presence of brain tumors.
    A Non-Classical Parameterization for Density Estimation Using Sample Moments. (arXiv:2201.04786v1 [stat.ML])
    (2 min) Moment methods are an important means of density estimation, but they are generally strongly dependent on the choice of feasible functions, which severely affects the performance. We propose a non-classical parameterization for density estimation using the sample moments, which does not require the choice of such functions. The parameterization is induced by the Kullback-Leibler distance, and the solution of it, which is proved to exist and be unique subject to simple prior that does not depend on data, can be obtained by convex optimization. Simulation results show the performance of the proposed estimator in estimating multi-modal densities which are mixtures of different types of functions.
    Applying Machine Learning and AI Explanations to Analyze Vaccine Hesitancy. (arXiv:2201.05070v1 [cs.LG])
    (2 min) The paper quantifies the impact of race, poverty, politics, and age on COVID-19 vaccination rates in counties in the continental US. Both, OLS regression analysis and Random Forest machine learning algorithms are applied to quantify factors for county-level vaccination hesitancy. The machine learning model considers joint effects of variables (race/ethnicity, partisanship, age, etc.) simultaneously to capture the unique combination of these factors on the vaccination rate. By implementing a state-of-the-art Artificial Intelligence Explanations (AIX) algorithm, it is possible to solve the black box problem with machine learning models and provide answers to the "how much" question for each measured impact factor in every county. For most counties, a higher percentage vote for Republicans, a greater African American population share, and a higher poverty rate lower the vaccination rate. While a higher Asian population share increases the predicted vaccination rate. The impact on the vaccination rate from the Hispanic population proportion is positive in the OLS model, but only positive for counties with a high Hispanic population (>65%) in the Random Forest model. Both the proportion of seniors and the one for young people in a county have a significant impact in the OLS model - positive and negative, respectively. In contrast, the impacts are ambiguous in the Random Forest model. Because results vary between geographies and since the AIX algorithm is able to quantify vaccine impacts individually for each county, this research can be tailored to local communities. An interactive online mapping dashboard that identifies impact factors for individual U.S. counties is available at https://www.cpp.edu/~clange/vacmap.html. It is apparent that the influence of impact factors is not universally the same across different geographies.
    Towards Automated Error Analysis: Learning to Characterize Errors. (arXiv:2201.05017v1 [cs.CL])
    (2 min) Characterizing the patterns of errors that a system makes helps researchers focus future development on increasing its accuracy and robustness. We propose a novel form of "meta learning" that automatically learns interpretable rules that characterize the types of errors that a system makes, and demonstrate these rules' ability to help understand and improve two NLP systems. Our approach works by collecting error cases on validation data, extracting meta-features describing these samples, and finally learning rules that characterize errors using these features. We apply our approach to VilBERT, for Visual Question Answering, and RoBERTa, for Common Sense Question Answering. Our system learns interpretable rules that provide insights into systemic errors these systems make on the given tasks. Using these insights, we are also able to "close the loop" and modestly improve performance of these systems.
    Active Learning-Based Multistage Sequential Decision-Making Model with Application on Common Bile Duct Stone Evaluation. (arXiv:2201.04807v1 [stat.ML])
    (2 min) Multistage sequential decision-making scenarios are commonly seen in the healthcare diagnosis process. In this paper, an active learning-based method is developed to actively collect only the necessary patient data in a sequential manner. There are two novelties in the proposed method. First, unlike the existing ordinal logistic regression model which only models a single stage, we estimate the parameters for all stages together. Second, it is assumed that the coefficients for common features in different stages are kept consistent. The effectiveness of the proposed method is validated in both a simulation study and a real case study. Compared with the baseline method where the data is modeled individually and independently, the proposed method improves the estimation efficiency by 62\%-1838\%. For both simulation and testing cohorts, the proposed method is more effective, stable, interpretable, and computationally efficient on parameter estimation. The proposed method can be easily extended to a variety of scenarios where decision-making can be done sequentially with only necessary information.
    Machine Learning-enhanced Efficient Spectroscopic Ellipsometry Modeling. (arXiv:2201.04933v1 [cond-mat.mtrl-sci])
    (2 min) Over the recent years, there has been an extensive adoption of Machine Learning (ML) in a plethora of real-world applications, ranging from computer vision to data mining and drug discovery. In this paper, we utilize ML to facilitate efficient film fabrication, specifically Atomic Layer Deposition (ALD). In order to make advances in ALD process development, which is utilized to generate thin films, and its subsequent accelerated adoption in industry, it is imperative to understand the underlying atomistic processes. Towards this end, in situ techniques for monitoring film growth, such as Spectroscopic Ellipsometry (SE), have been proposed. However, in situ SE is associated with complex hardware and, hence, is resource intensive. To address these challenges, we propose an ML-based approach to expedite film thickness estimation. The proposed approach has tremendous implications of faster data acquisition, reduced hardware complexity and easier integration of spectroscopic ellipsometry for in situ monitoring of film thickness deposition. Our experimental results involving SE of TiO2 demonstrate that the proposed ML-based approach furnishes promising thickness prediction accuracy results of 88.76% within +/-1.5 nm and 85.14% within +/-0.5 nm intervals. Furthermore, we furnish accuracy results up to 98% at lower thicknesses, which is a significant improvement over existing SE-based analysis, thereby making our solution a viable option for thickness estimation of ultrathin films.
    Functional Anomaly Detection: a Benchmark Study. (arXiv:2201.05115v1 [stat.ML])
    (2 min) The increasing automation in many areas of the Industry expressly demands to design efficient machine-learning solutions for the detection of abnormal events. With the ubiquitous deployment of sensors monitoring nearly continuously the health of complex infrastructures, anomaly detection can now rely on measurements sampled at a very high frequency, providing a very rich representation of the phenomenon under surveillance. In order to exploit fully the information thus collected, the observations cannot be treated as multivariate data anymore and a functional analysis approach is required. It is the purpose of this paper to investigate the performance of recent techniques for anomaly detection in the functional setup on real datasets. After an overview of the state-of-the-art and a visual-descriptive study, a variety of anomaly detection methods are compared. While taxonomies of abnormalities (e.g. shape, location) in the functional setup are documented in the literature, assigning a specific type to the identified anomalies appears to be a challenging task. Thus, strengths and weaknesses of the existing approaches are benchmarked in view of these highlighted types in a simulation study. Anomaly detection methods are next evaluated on two datasets, related to the monitoring of helicopters in flight and to the spectrometry of construction materials namely. The benchmark analysis is concluded by recommendation guidance for practitioners.
    Forecast-based Multi-aspect Framework for Multivariate Time-series Anomaly Detection. (arXiv:2201.04792v1 [cs.LG])
    (2 min) Today's cyber-world is vastly multivariate. Metrics collected at extreme varieties demand multivariate algorithms to properly detect anomalies. However, forecast-based algorithms, as widely proven approaches, often perform sub-optimally or inconsistently across datasets. A key common issue is they strive to be one-size-fits-all but anomalies are distinctive in nature. We propose a method that tailors to such distinction. Presenting FMUAD - a Forecast-based, Multi-aspect, Unsupervised Anomaly Detection framework. FMUAD explicitly and separately captures the signature traits of anomaly types - spatial change, temporal change and correlation change - with independent modules. The modules then jointly learn an optimal feature representation, which is highly flexible and intuitive, unlike most other models in the category. Extensive experiments show our FMUAD framework consistently outperforms other state-of-the-art forecast-based anomaly detectors.
    On neural network kernels and the storage capacity problem. (arXiv:2201.04669v1 [cond-mat.dis-nn])
    (2 min) In this short note, we reify the connection between work on the storage capacity problem in wide two-layer treelike neural networks and the rapidly-growing body of literature on kernel limits of wide neural networks. Concretely, we observe that the "effective order parameter" studied in the statistical mechanics literature is exactly equivalent to the infinite-width Neural Network Gaussian Process Kernel. This correspondence connects the expressivity and trainability of wide two-layer neural networks.
    Reconstructing Training Data with Informed Adversaries. (arXiv:2201.04845v1 [cs.CR])
    (2 min) Given access to a machine learning model, can an adversary reconstruct the model's training data? This work studies this question from the lens of a powerful informed adversary who knows all the training data points except one. By instantiating concrete attacks, we show it is feasible to reconstruct the remaining data point in this stringent threat model. For convex models (e.g. logistic regression), reconstruction attacks are simple and can be derived in closed-form. For more general models (e.g. neural networks), we propose an attack strategy based on training a reconstructor network that receives as input the weights of the model under attack and produces as output the target data point. We demonstrate the effectiveness of our attack on image classifiers trained on MNIST and CIFAR-10, and systematically investigate which factors of standard machine learning pipelines affect reconstruction success. Finally, we theoretically investigate what amount of differential privacy suffices to mitigate reconstruction attacks by informed adversaries. Our work provides an effective reconstruction attack that model developers can use to assess memorization of individual points in general settings beyond those considered in previous works (e.g. generative language models or access to training gradients); it shows that standard models have the capacity to store enough information to enable high-fidelity reconstruction of training data points; and it demonstrates that differential privacy can successfully mitigate such attacks in a parameter regime where utility degradation is minimal.
    Conditional Variational Autoencoder with Balanced Pre-training for Generative Adversarial Networks. (arXiv:2201.04809v1 [cs.CV])
    (2 min) Class imbalance occurs in many real-world applications, including image classification, where the number of images in each class differs significantly. With imbalanced data, the generative adversarial networks (GANs) leans to majority class samples. The two recent methods, Balancing GAN (BAGAN) and improved BAGAN (BAGAN-GP), are proposed as an augmentation tool to handle this problem and restore the balance to the data. The former pre-trains the autoencoder weights in an unsupervised manner. However, it is unstable when the images from different categories have similar features. The latter is improved based on BAGAN by facilitating supervised autoencoder training, but the pre-training is biased towards the majority classes. In this work, we propose a novel Conditional Variational Autoencoder with Balanced Pre-training for Generative Adversarial Networks (CAPGAN) as an augmentation tool to generate realistic synthetic images. In particular, we utilize a conditional convolutional variational autoencoder with supervised and balanced pre-training for the GAN initialization and training with gradient penalty. Our proposed method presents a superior performance of other state-of-the-art methods on the highly imbalanced version of MNIST, Fashion-MNIST, CIFAR-10, and two medical imaging datasets. Our method can synthesize high-quality minority samples in terms of Fr\'echet inception distance, structural similarity index measure and perceptual quality.
    Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond. (arXiv:2201.05024v1 [eess.SP])
    (2 min) Adaptive partial linear beamforming meets the need of 5G and future 6G applications for high flexibility and adaptability. Choosing an appropriate tradeoff between conflicting goals opens the recently proposed multiuser (MU) detection method. Due to their high spatial resolution, nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. However, a dramatic decrease in performance can be expected in high mobility scenarios because they are very susceptible to changes in the wireless channel. The robustness of linear filters is required, considering these changes. One way to respond appropriately is to use online machine learning algorithms. The theory of algorithms based on the adaptive projected subgradient method (APSM) is rich, and they promise accurate tracking capabilities in dynamic wireless environments. However, one of the main challenges comes from the real-time implementation of these algorithms, which involve projections on time-varying closed convex sets. While the projection operations are relatively simple, their vast number poses a challenge in ultralow latency (ULL) applications where latency constraints must be satisfied in every radio frame. Taking non-orthogonal multiple access (NOMA) systems as an example, this paper explores the acceleration of APSM-based algorithms through massive parallelization. The result is a GPU-accelerated real-time implementation of an orthogonal frequency-division multiplexing (OFDM)-based transceiver that enables detection latency of less than one millisecond and therefore complies with the requirements of 5G and beyond. To meet the stringent physical layer latency requirements, careful co-design of hardware and software is essential, especially in virtualized wireless systems with hardware accelerators.
    Recursive Least Squares for Training and Pruning Convolutional Neural Networks. (arXiv:2201.04813v1 [cs.LG])
    (2 min) Convolutional neural networks (CNNs) have succeeded in many practical applications. However, their high computation and storage requirements often make them difficult to deploy on resource-constrained devices. In order to tackle this issue, many pruning algorithms have been proposed for CNNs, but most of them can't prune CNNs to a reasonable level. In this paper, we propose a novel algorithm for training and pruning CNNs based on the recursive least squares (RLS) optimization. After training a CNN for some epochs, our algorithm combines inverse input autocorrelation matrices and weight matrices to evaluate and prune unimportant input channels or nodes layer by layer. Then, our algorithm will continue to train the pruned network, and won't do the next pruning until the pruned network recovers the full performance of the old network. Besides for CNNs, the proposed algorithm can be used for feedforward neural networks (FNNs). Three experiments on MNIST, CIFAR-10 and SVHN datasets show that our algorithm can achieve the more reasonable pruning and have higher learning efficiency than other four popular pruning algorithms.
  • stat.ML updates on arXiv.org

    Multi-task longitudinal forecasting with missing values on Alzheimer's Disease. (arXiv:2201.05040v1 [stat.ML])
    (2 min) Machine learning techniques typically applied to dementia forecasting lack in their capabilities to jointly learn several tasks, handle time dependent heterogeneous data and missing values. In this paper, we propose a framework using the recently presented SSHIBA model for jointly learning different tasks on longitudinal data with missing values. The method uses Bayesian variational inference to impute missing values and combine information of several views. This way, we can combine different data-views from different time-points in a common latent space and learn the relations between each time-point while simultaneously modelling and predicting several output variables. We apply this model to predict together diagnosis, ventricle volume, and clinical scores in dementia. The results demonstrate that SSHIBA is capable of learning a good imputation of the missing values and outperforming the baselines while simultaneously predicting three different tasks.
    Functional Anomaly Detection: a Benchmark Study. (arXiv:2201.05115v1 [stat.ML])
    (0 min) The increasing automation in many areas of the Industry expressly demands to design efficient machine-learning solutions for the detection of abnormal events. With the ubiquitous deployment of sensors monitoring nearly continuously the health of complex infrastructures, anomaly detection can now rely on measurements sampled at a very high frequency, providing a very rich representation of the phenomenon under surveillance. In order to exploit fully the information thus collected, the observations cannot be treated as multivariate data anymore and a functional analysis approach is required. It is the purpose of this paper to investigate the performance of recent techniques for anomaly detection in the functional setup on real datasets. After an overview of the state-of-the-art and a visual-descriptive study, a variety of anomaly detection methods are compared. While taxonomies of abnormalities (e.g. shape, location) in the functional setup are documented in the literature, assigning a specific type to the identified anomalies appears to be a challenging task. Thus, strengths and weaknesses of the existing approaches are benchmarked in view of these highlighted types in a simulation study. Anomaly detection methods are next evaluated on two datasets, related to the monitoring of helicopters in flight and to the spectrometry of construction materials namely. The benchmark analysis is concluded by recommendation guidance for practitioners.
    Partial Recovery in the Graph Alignment Problem. (arXiv:2007.00533v5 [stat.ML] UPDATED)
    (0 min) In this paper, we consider the graph alignment problem, which is the problem of recovering, given two graphs, a one-to-one mapping between nodes that maximizes edge overlap. This problem can be viewed as a noisy version of the well-known graph isomorphism problem and appears in many applications, including social network deanonymization and cellular biology. Our focus here is on partial recovery, i.e., we look for a one-to-one mapping which is correct on a fraction of the nodes of the graph rather than on all of them, and we assume that the two input graphs to the problem are correlated Erd\H{o}s-R\'enyi graphs of parameters $(n,q,s)$. Our main contribution is then to give necessary and sufficient conditions on $(n,q,s)$ under which partial recovery is possible with high probability as the number of nodes $n$ goes to infinity. In particular, we show that it is possible to achieve partial recovery in the $nqs=\Theta(1)$ regime under certain additional assumptions.
    DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training. (arXiv:2007.03298v2 [cs.LG] UPDATED)
    (0 min) Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy. In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups to synchronize independently in a bottleneck-free manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among different groups to ensure a global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to $94\%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.
    On component interactions in two-stage recommender systems. (arXiv:2106.14979v3 [cs.IR] UPDATED)
    (0 min) Thanks to their scalability, two-stage recommenders are used by many of today's largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators, tuned for low prediction latency, preselect a small subset of candidates from the whole item pool; (ii) a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as mere sums of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of the individual components in isolation. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that independent nominator training can lead to performance on par with uniformly random recommendations. We find that careful design of item pools, each assigned to a different nominator, alleviates these issues. As manual search for a good pool allocation is difficult, we propose to learn one instead using a Mixture-of-Experts based approach. This significantly improves both precision and recall at K.
    Hyperparameter Importance for Machine Learning Algorithms. (arXiv:2201.05132v1 [stat.ML])
    (0 min) Hyperparameter plays an essential role in the fitting of supervised machine learning algorithms. However, it is computationally expensive to tune all the tunable hyperparameters simultaneously especially for large data sets. In this paper, we give a definition of hyperparameter importance that can be estimated by subsampling procedures. According to the importance, hyperparameters can then be tuned on the entire data set more efficiently. We show theoretically that the proposed importance on subsets of data is consistent with the one on the population data under weak conditions. Numerical experiments show that the proposed importance is consistent and can save a lot of computational resources.
    Online 3D Bin Packing with Constrained Deep Reinforcement Learning. (arXiv:2006.14978v5 [cs.LG] UPDATED)
    (0 min) We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into the bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of collision avoidance and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process. To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a feasibility predictor to predict the feasibility mask for the placement actions and use it to modulate the action probabilities output by the actor during training. Such supervisions and transformations to DRL facilitate the agent to learn feasible policies efficiently. Our method can also be generalized e.g., with the ability to handle lookahead or items with different orientations. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A user study suggests that our method attains a human-level performance.
    Planning in Observable POMDPs in Quasipolynomial Time. (arXiv:2201.04735v1 [cs.LG])
    (0 min) Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: are there natural and well-motivated assumptions that make planning easy? Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to well-separated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -- i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.
    Continuous Herded Gibbs Sampling. (arXiv:2106.06430v2 [stat.ML] UPDATED)
    (0 min) Herding is a technique to sequentially generate deterministic samples from a probability distribution. In this work, we propose a continuous herded Gibbs sampler that combines kernel herding on continuous densities with the Gibbs sampling idea. Our algorithm allows for deterministically sampling from high-dimensional multivariate probability densities, without directly sampling from the joint density. Experiments with Gaussian mixture densities indicate that the L2 error decreases similarly to kernel herding, while the computation time is significantly lower, i.e., linear in the number of dimensions.
    The curse of overparametrization in adversarial training: Precise analysis of robust generalization for random features regression. (arXiv:2201.05149v1 [cs.LG])
    (0 min) Successful deep learning models often involve training neural network architectures that contain more parameters than the number of training samples. Such overparametrized models have been extensively studied in recent years, and the virtues of overparametrization have been established from both the statistical perspective, via the double-descent phenomenon, and the computational perspective via the structural properties of the optimization landscape. Despite the remarkable success of deep learning architectures in the overparametrized regime, it is also well known that these models are highly vulnerable to small adversarial perturbations in their inputs. Even when adversarially trained, their performance on perturbed inputs (robust generalization) is considerably worse than their best attainable performance on benign inputs (standard generalization). It is thus imperative to understand how overparametrization fundamentally affects robustness. In this paper, we will provide a precise characterization of the role of overparametrization on robustness by focusing on random features regression models (two-layer neural networks with random first layer weights). We consider a regime where the sample size, the input dimension and the number of parameters grow in proportion to each other, and derive an asymptotically exact formula for the robust generalization error when the model is adversarially trained. Our developed theory reveals the nontrivial effect of overparametrization on robustness and indicates that for adversarially trained random features models, high overparametrization can hurt robust generalization.
    A Comparative Study on Basic Elements of Deep Learning Models for Spatial-Temporal Traffic Forecasting. (arXiv:2111.07513v2 [cs.LG] UPDATED)
    (0 min) Traffic forecasting plays a crucial role in intelligent transportation systems. The spatial-temporal complexities in transportation networks make the problem especially challenging. The recently suggested deep learning models share basic elements such as graph convolution, graph attention, recurrent units, and/or attention mechanism. In this study, we designed an in-depth comparative study for four deep neural network models utilizing different basic elements. For base models, one RNN-based model and one attention-based model were chosen from previous literature. Then, the spatial feature extraction layers in the models were substituted with graph convolution and graph attention. To analyze the performance of each element in various environments, we conducted experiments on four real-world datasets - highway speed, highway flow, urban speed from a homogeneous road link network, and urban speed from a heterogeneous road link network. The results demonstrate that the RNN-based model and the attention-based model show a similar level of performance for short-term prediction, and the attention-based model outperforms the RNN in longer-term predictions. The choice of graph convolution and graph attention makes a larger difference in the RNN-based models. Also, our modified version of GMAN shows comparable performance with the original with less memory consumption.
    Pointwise Binary Classification with Pairwise Confidence Comparisons. (arXiv:2010.01875v4 [cs.LG] UPDATED)
    (0 min) To alleviate the data requirement for training effective binary classifiers in binary classification, many weakly supervised learning settings have been proposed. Among them, some consider using pairwise but not pointwise labels, when pointwise labels are not accessible due to privacy, confidentiality, or security reasons. However, as a pairwise label denotes whether or not two data points share a pointwise label, it cannot be easily collected if either point is equally likely to be positive or negative. Thus, in this paper, we propose a novel setting called pairwise comparison (Pcomp) classification, where we have only pairs of unlabeled data that we know one is more likely to be positive than the other. Firstly, we give a Pcomp data generation process, derive an unbiased risk estimator (URE) with theoretical guarantee, and further improve URE using correction functions. Secondly, we link Pcomp classification to noisy-label learning to develop a progressive URE and improve it by imposing consistency regularization. Finally, we demonstrate by experiments the effectiveness of our methods, which suggests Pcomp is a valuable and practically useful type of pairwise supervision besides the pairwise label.
    A Non-Classical Parameterization for Density Estimation Using Sample Moments. (arXiv:2201.04786v1 [stat.ML])
    (0 min) Moment methods are an important means of density estimation, but they are generally strongly dependent on the choice of feasible functions, which severely affects the performance. We propose a non-classical parameterization for density estimation using the sample moments, which does not require the choice of such functions. The parameterization is induced by the Kullback-Leibler distance, and the solution of it, which is proved to exist and be unique subject to simple prior that does not depend on data, can be obtained by convex optimization. Simulation results show the performance of the proposed estimator in estimating multi-modal densities which are mixtures of different types of functions.
    Hyperparameter Selection for Subsampling Bootstraps. (arXiv:2006.01786v2 [stat.ME] UPDATED)
    (0 min) Massive data analysis becomes increasingly prevalent, subsampling methods like BLB (Bag of Little Bootstraps) serves as powerful tools for assessing the quality of estimators for massive data. However, the performance of the subsampling methods are highly influenced by the selection of tuning parameters ( e.g., the subset size, number of resamples per subset ). In this article we develop a hyperparameter selection methodology, which can be used to select tuning parameters for subsampling methods. Specifically, by a careful theoretical analysis, we find an analytically simple and elegant relationship between the asymptotic efficiency of various subsampling estimators and their hyperparameters. This leads to an optimal choice of the hyperparameters. More specifically, for an arbitrarily specified hyperparameter set, we can improve it to be a new set of hyperparameters with no extra CPU time cost, but the resulting estimator's statistical efficiency can be much improved. Both simulation studies and real data analysis demonstrate the superior advantage of our method.
    Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs. (arXiv:2110.05518v2 [cs.LG] UPDATED)
    (0 min) Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple three-layer ReLU sub-networks with weight decay regularization can be equivalently cast as a convex optimization problem in a higher dimensional space, where sparsity is enforced via a group $\ell_1$-norm regularization. Consequently, ReLU networks can be interpreted as high dimensional feature selection methods. More importantly, we then prove that the equivalent convex problem can be globally optimized by a standard convex optimization solver with a polynomial-time complexity with respect to the number of samples and data dimension when the width of the network is fixed. Finally, we numerically validate our theoretical results via experiments involving both synthetic and real datasets.
    Unlabeled Data Improves Adversarial Robustness. (arXiv:1905.13736v4 [stat.ML] UPDATED)
    (0 min) We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $\ell_\infty$ robustness against several strong attacks via adversarial training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via randomized smoothing. On SVHN, adding the dataset's own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels.
    Active Learning-Based Multistage Sequential Decision-Making Model with Application on Common Bile Duct Stone Evaluation. (arXiv:2201.04807v1 [stat.ML])
    (0 min) Multistage sequential decision-making scenarios are commonly seen in the healthcare diagnosis process. In this paper, an active learning-based method is developed to actively collect only the necessary patient data in a sequential manner. There are two novelties in the proposed method. First, unlike the existing ordinal logistic regression model which only models a single stage, we estimate the parameters for all stages together. Second, it is assumed that the coefficients for common features in different stages are kept consistent. The effectiveness of the proposed method is validated in both a simulation study and a real case study. Compared with the baseline method where the data is modeled individually and independently, the proposed method improves the estimation efficiency by 62\%-1838\%. For both simulation and testing cohorts, the proposed method is more effective, stable, interpretable, and computationally efficient on parameter estimation. The proposed method can be easily extended to a variety of scenarios where decision-making can be done sequentially with only necessary information.
    Generalized Kernel Ridge Regression for Long Term Causal Inference: Treatment Effects, Dose Responses, and Counterfactual Distributions. (arXiv:2201.05139v1 [econ.EM])
    (0 min) I propose kernel ridge regression estimators for long term causal inference, where a short term experimental data set containing randomized treatment and short term surrogates is fused with a long term observational data set containing short term surrogates and long term outcomes. I propose estimators of treatment effects, dose responses, and counterfactual distributions with closed form solutions in terms of kernel matrix operations. I allow covariates, treatment, and surrogates to be discrete or continuous, and low, high, or infinite dimensional. For long term treatment effects, I prove $\sqrt{n}$ consistency, Gaussian approximation, and semiparametric efficiency. For long term dose responses, I prove uniform consistency with finite sample rates. For long term counterfactual distributions, I prove convergence in distribution.
    Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?. (arXiv:2201.05119v1 [cs.CV])
    (0 min) Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from Mitrovic et al., 2021, we propose ReLICv2 which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views. ReLICv2 achieves 77.1% top-1 classification accuracy on ImageNet using linear evaluation with a ResNet50 architecture and 80.6% with larger ResNet models, outperforming previous state-of-the-art self-supervised approaches by a wide margin. Most notably, ReLICv2 is the first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison using a range of standard ResNet architectures. Finally we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.
    Spatiotemporal Clustering with Neyman-Scott Processes via Connections to Bayesian Nonparametric Mixture Models. (arXiv:2201.05044v1 [stat.ML])
    (0 min) Neyman-Scott process (NSP) are point process models that generate clusters of points in time or space. They are natural models for a wide range of phenomena, ranging from neural spike trains to document streams. The clustering property is achieved via a doubly stochastic formulation: first, a set of latent events is drawn from a Poisson process; then, each latent event generates a set of observed data points according to another Poisson process. This construction is similar to Bayesian nonparametric mixture models like the Dirichlet process mixture model (DPMM) in that the number of latent events (i.e. clusters) is a random variable, but the point process formulation makes the NSP especially well suited to modeling spatiotemporal data. While many specialized algorithms have been developed for DPMMs, comparatively fewer works have focused on inference in NSPs. Here, we present novel connections between NSPs and DPMMs, with the key link being a third class of Bayesian mixture models called mixture of finite mixture models (MFMMs). Leveraging this connection, we adapt the standard collapsed Gibbs sampling algorithm for DPMMs to enable scalable Bayesian inference on NSP models. We demonstrate the potential of Neyman-Scott processes on a variety of applications including sequence detection in neural spike trains and event detection in document streams.
    How Tight Can PAC-Bayes be in the Small Data Regime?. (arXiv:2106.03542v4 [stat.ML] UPDATED)
    (2 min) In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We find that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.
    Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks. (arXiv:2201.04738v1 [stat.ML])
    (2 min) We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that in the underparameterized regime the network learns eigenfunctions of an integral operator $T_{K^\infty}$ determined by the Neural Tangent Kernel (NTK) at rates corresponding to their eigenvalues. For example, for uniformly distributed data on the sphere $S^{d - 1}$ and rotation invariant weight distributions, the eigenfunctions of $T_{K^\infty}$ are the spherical harmonics. Our results can be understood as describing a spectral bias in the underparameterized regime. The proofs use the concept of "Damped Deviations", where deviations of the NTK matter less for eigendirections with large eigenvalues due to the occurence of a damping factor. Aside from the underparameterized regime, the damped deviations point-of-view can be used to track the dynamics of the empirical risk in the overparameterized setting, allowing us to extend certain results in the literature. We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
    A robust kernel machine regression towards biomarker selection in multi-omics datasets of osteoporosis for drug discovery. (arXiv:2201.05060v1 [stat.ML])
    (2 min) Many statistical machine approaches could ultimately highlight novel features of the etiology of complex diseases by analyzing multi-omics data. However, they are sensitive to some deviations in distribution when the observed samples are potentially contaminated with adversarial corrupted outliers (e.g., a fictional data distribution). Likewise, statistical advances lag in supporting comprehensive data-driven analyses of complex multi-omics data integration. We propose a novel non-linear M-estimator-based approach, "robust kernel machine regression (RobKMR)," to improve the robustness of statistical machine regression and the diversity of fictional data to examine the higher-order composite effect of multi-omics datasets. We address a robust kernel-centered Gram matrix to estimate the model parameters accurately. We also propose a robust score test to assess the marginal and joint Hadamard product of features from multi-omics data. We apply our proposed approach to a multi-omics dataset of osteoporosis (OP) from Caucasian females. Experiments demonstrate that the proposed approach effectively identifies the inter-related risk factors of OP. With solid evidence (p-value = 0.00001), biological validations, network-based analysis, causal inference, and drug repurposing, the selected three triplets ((DKK1, SMTN, DRGX), (MTND5, FASTKD2, CSMD3), (MTND5, COG3, CSMD3)) are significant biomarkers and directly relate to BMD. Overall, the top three selected genes (DKK1, MTND5, FASTKD2) and one gene (SIDT1 at p-value= 0.001) significantly bond with four drugs- Tacrolimus, Ibandronate, Alendronate, and Bazedoxifene out of 30 candidates for drug repurposing in OP. Further, the proposed approach can be applied to any disease model where multi-omics datasets are available.
    Unifying Epidemic Models with Mixtures. (arXiv:2201.04960v1 [stat.ML])
    (2 min) The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges the two approaches while retaining benefits of both. The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models. Although the model is non-mechanistic, we show that it arises as the natural outcome of a stochastic process based on a networked SIR framework. This allows learned parameters to take on a more meaningful interpretation compared to similar non-mechanistic models, and we validate the interpretations using auxiliary mobility data collected during the COVID-19 pandemic. We provide a simple learning algorithm to identify model parameters and establish theoretical results which show the model can be efficiently learned from data. Empirically, we find the model to have low prediction error. The model is available live at covidpredictions.mit.edu. Ultimately, this allows us to systematically understand the impacts of interventions on COVID-19, which is critical in developing data-driven solutions to controlling epidemics.
    Generalized Shape Metrics on Neural Representations. (arXiv:2110.14739v2 [stat.ML] UPDATED)
    (0 min) Understanding the operation of biological and artificial networks remains a difficult and important challenge. To identify general principles, researchers are increasingly interested in surveying large collections of networks that are trained on, or biologically adapted to, similar tasks. A standardized set of analysis tools is now needed to identify how network-level covariates -- such as architecture, anatomical brain region, and model organism -- impact neural representations (hidden layer activations). Here, we provide a rigorous foundation for these analyses by defining a broad family of metric spaces that quantify representational dissimilarity. Using this framework we modify existing representational similarity measures based on canonical correlation analysis to satisfy the triangle inequality, formulate a novel metric that respects the inductive biases in convolutional layers, and identify approximate Euclidean embeddings that enable network representations to be incorporated into essentially any off-the-shelf machine learning method. We demonstrate these methods on large-scale datasets from biology (Allen Institute Brain Observatory) and deep learning (NAS-Bench-101). In doing so, we identify relationships between neural representations that are interpretable in terms of anatomical features and model performance.
    Symmetric Sparse Boolean Matrix Factorization and Applications. (arXiv:2102.01570v3 [cs.LG] UPDATED)
    (0 min) In this work, we study a variant of nonnegative matrix factorization where we wish to find a symmetric factorization of a given input matrix into a sparse, Boolean matrix. Formally speaking, given $\mathbf{M}\in\mathbb{Z}^{m\times m}$, we want to find $\mathbf{W}\in\{0,1\}^{m\times r}$ such that $\| \mathbf{M} - \mathbf{W}\mathbf{W}^\top \|_0$ is minimized among all $\mathbf{W}$ for which each row is $k$-sparse. This question turns out to be closely related to a number of questions like recovering a hypergraph from its line graph, as well as reconstruction attacks for private neural network training. As this problem is hard in the worst-case, we study a natural average-case variant that arises in the context of these reconstruction attacks: $\mathbf{M} = \mathbf{W}\mathbf{W}^{\top}$ for $\mathbf{W}$ a random Boolean matrix with $k$-sparse rows, and the goal is to recover $\mathbf{W}$ up to column permutation. Equivalently, this can be thought of as recovering a uniformly random $k$-uniform hypergraph from its line graph. Our main result is a polynomial-time algorithm for this problem based on bootstrapping higher-order information about $\mathbf{W}$ and then decomposing an appropriate tensor. The key ingredient in our analysis, which may be of independent interest, is to show that such a matrix $\mathbf{W}$ has full column rank with high probability as soon as $m = \widetilde{\Omega}(r)$, which we do using tools from Littlewood-Offord theory and estimates for binary Krawtchouk polynomials.
    Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond. (arXiv:2201.05024v1 [eess.SP])
    (2 min) Adaptive partial linear beamforming meets the need of 5G and future 6G applications for high flexibility and adaptability. Choosing an appropriate tradeoff between conflicting goals opens the recently proposed multiuser (MU) detection method. Due to their high spatial resolution, nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. However, a dramatic decrease in performance can be expected in high mobility scenarios because they are very susceptible to changes in the wireless channel. The robustness of linear filters is required, considering these changes. One way to respond appropriately is to use online machine learning algorithms. The theory of algorithms based on the adaptive projected subgradient method (APSM) is rich, and they promise accurate tracking capabilities in dynamic wireless environments. However, one of the main challenges comes from the real-time implementation of these algorithms, which involve projections on time-varying closed convex sets. While the projection operations are relatively simple, their vast number poses a challenge in ultralow latency (ULL) applications where latency constraints must be satisfied in every radio frame. Taking non-orthogonal multiple access (NOMA) systems as an example, this paper explores the acceleration of APSM-based algorithms through massive parallelization. The result is a GPU-accelerated real-time implementation of an orthogonal frequency-division multiplexing (OFDM)-based transceiver that enables detection latency of less than one millisecond and therefore complies with the requirements of 5G and beyond. To meet the stringent physical layer latency requirements, careful co-design of hardware and software is essential, especially in virtualized wireless systems with hardware accelerators.
    Metric-Free Individual Fairness in Online Learning. (arXiv:2002.05474v5 [cs.LG] UPDATED)
    (0 min) We study an online learning problem subject to the constraint of individual fairness, which requires that similar individuals are treated similarly. Unlike prior work on individual fairness, we do not assume the similarity measure among individuals is known, nor do we assume that such measure takes a certain parametric form. Instead, we leverage the existence of an auditor who detects fairness violations without enunciating the quantitative measure. In each round, the auditor examines the learner's decisions and attempts to identify a pair of individuals that are treated unfairly by the learner. We provide a general reduction framework that reduces online classification in our model to standard online classification, which allows us to leverage existing online learning algorithms to achieve sub-linear regret and number of fairness violations. Surprisingly, in the stochastic setting where the data are drawn independently from a distribution, we are also able to establish PAC-style fairness and accuracy generalization guarantees (Yona and Rothblum [2018]), despite only having access to a very restricted form of fairness feedback. Our fairness generalization bound qualitatively matches the uniform convergence bound of Yona and Rothblum [2018], while also providing a meaningful accuracy generalization guarantee. Our results resolve an open question by Gillen et al. [2018] by showing that online learning under an unknown individual fairness constraint is possible even without assuming a strong parametric form of the underlying similarity measure.
    Combining Interventional and Observational Data Using Causal Reductions. (arXiv:2103.04786v2 [stat.ML] UPDATED)
    (2 min) Unobserved confounding is one of the main challenges when estimating causal effects. We propose a causal reduction method that, given a causal model, replaces an arbitrary number of possibly high-dimensional latent confounders with a single latent confounder that takes values in the same space as the treatment variable, without changing the observational and interventional distributions the causal model entails. This allows us to estimate the causal effect in a principled way from combined data without relying on the common but often unrealistic assumption that all confounders have been observed. We apply our causal reduction in three different settings. In the first setting, we assume the treatment and outcome to be discrete. The causal reduction then implies bounds between the observational and interventional distributions that can be exploited for estimation purposes. In certain cases with highly unbalanced observational samples, the accuracy of the causal effect estimate can be improved by incorporating observational data. Second, for continuous variables and assuming a linear-Gaussian model, we derive equality constraints for the parameters of the observational and interventional distributions. Third, for the general continuous setting (possibly nonlinear or non-Gaussian), we parameterize the reduced causal model using normalizing flows, a flexible class of easily invertible nonlinear transformations. We perform a series of experiments on synthetic data and find that in several cases the number of interventional samples can be reduced when adding observational training samples without sacrificing accuracy.
    Fractal Gaussian Networks: A sparse random graph model based on Gaussian Multiplicative Chaos. (arXiv:2008.03038v2 [stat.ML] UPDATED)
    (2 min) We propose a novel stochastic network model, called Fractal Gaussian Network (FGN), that embodies well-defined and analytically tractable fractal structures. Such fractal structures have been empirically observed in diverse applications. FGNs interpolate continuously between the popular purely random geometric graphs (a.k.a. the Poisson Boolean network), and random graphs with increasingly fractal behavior. In fact, they form a parametric family of sparse random geometric graphs that are parametrized by a fractality parameter which governs the strength of the fractal structure. FGNs are driven by the latent spatial geometry of Gaussian Multiplicative Chaos (GMC), a canonical model of fractality in its own right. We asymptotically characterize the expected number of edges, triangles, cliques and hub-and-spoke motifs in FGNs, unveiling a distinct pattern in their scaling with the size parameter of the network. We then examine the natural question of detecting the presence of fractality and the problem of parameter estimation based on observed network data, in addition to fundamental properties of the FGN as a random graph model. We also explore fractality in community structures by unveiling a natural stochastic block model in the setting of FGNs. Finally, we substantiate our results with phenomenological analysis of the FGN in the context of available scientific literature for fractality in networks, including applications to real-world massive network data.
  • John D. Cook

    Multiserver queues with general probability distributions
    (2 min) Textbook presentations of queueing theory start by assuming that the time between customer arrivals and the time to serve a customer are both exponentially distributed. Maybe these assumptions are realistic enough, but maybe not. For example, maybe there’s a cutoff on service time. If someone takes too long, you tell them there’s no more you […] Multiserver queues with general probability distributions first appeared on John D. Cook.
    Queueing and Economies of Scale
    (2 min) If a single server is being pushed to the limit, adding a second server can drastically reduce wait times. This is true whether the server is a human serving food or a computer serving web pages. I first wrote about this here and I give more details here. What if you add an extra server […] Queueing and Economies of Scale first appeared on John D. Cook.
    Linear logic arithmetic
    (1 min) Linear logic has connectives not used in classical logic. The connectives ⊗and & are conjunctions, ⊕ and ⅋ are disjunctions, and ! and ? are analogous to the modal operators ◻ and ◇ (necessity and possibility). Another way to classify the connectives is to say ⊕ and & are called additive, ⊗ and ⅋ are […] Linear logic arithmetic first appeared on John D. Cook.
  • Artificial Intelligence

    Trained GPT-2 to write poems
    (2 min) BEHOLD! I'm presenting you... A terrible mix of Philips Sydney, William Blake and Emily Dickinson. Enjoy reading those terrible poems of GPT-2 o_o THE ECHOING GREEN My mistress calls me When I am in the sky, And I am just, and bright. She calls me when I am in the sky, And she sings My joys, my sorrows, To my songs you sweet child can hear. Oh! how sweet is the sound Of God! As he smiles his child appears, And smiles on me, and out of me shine. I previously shared the next one, but it feels nice to read. That's one of the best I guess. A MAN'S DIALOGUE The man's life revolves round a circle; There, in the middle, the circle of life Reveals itself, all its life, Except and within itself, The whole circle is alive. The AI... Oh Boy... What have you done GPT-2? Sweet is his wife, if sh…
    I built an AI bot that draws people’s dream jobs on Twitter
    (1 min) submitted by /u/maaartiin_mac [link] [comments]
    A Science Journalist’s Journey to Understand AI
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    Meta AI Introduces AV-HuBERT: A State-Of-The-Art Self-Supervised Framework For Understanding Speech That Learns By Both Seeing And Hearing People Speak
    (1 min) AI is used for various speech recognition and understanding activities, ranging from enabling smart speakers to designing aids for persons who are deaf or have speech impairments. However, these speech comprehension algorithms frequently fail to perform well in everyday scenarios where we need them the most: when numerous people speak simultaneously or when there is a lot of background noise. Even advanced noise-canceling techniques are frequently ineffective against, for instance, the sound of the ocean on a beach trip or the background chatter of a noisy street market. Humans interpret speech better than AI in these situations because we use both our ears and our sight. For example, we could watch someone’s mouth move and intuitively know the voice we’re hearing is coming from her. That is why Meta AI is developing new conversational AI systems that, like us, can discern intricate relationships between what they see and what they hear in conversation. Continue Reading Paper 1: https://arxiv.org/abs/2201.01763? Paper 2: https://arxiv.org/abs/2201.02184? Github: https://facebookresearch.github.io/av_hubert/ ​ https://i.redd.it/zz4piiyzgwb81.gif submitted by /u/techsucker [link] [comments]
    Remove Unwanted Objects From High-Quality Images! (not only 256x256...!). LaMa explained
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Need a book about artificial intelligence/neural network/soft computing written by an Indian author. If you have pdf please share!
    (1 min) Any book will suffice for me. But the books that I am looking for is this-: https://www.amazon.com/Introduction-Neural-Networks-Matlab-6-0/dp/0070591121 https://b-ok.asia/book/11849928/95de3e There is 1 book in b-ok.org but some guy has just posted the google books preview pdf of this book. lol. If you studied artificial intelligence/soft computing/neural networks in university can you share your book you used to pass university exam questions? I am from Nepal and our syllabus is copied from those Indian university colleges. So having their book would be helpful. But the thing is that Indian author's textbook are not available in Nepal lol. submitted by /u/barnyard9 [link] [comments]
    AI based diet and nutrition planner?
    (0 min) Any app or website that has a mature AI engine to recommend diet and nutrition plan? Or an open source engine that does it when trained. submitted by /u/purelibran [link] [comments]
  • Machine Learning

    [P] A tutorial on evolving network parameters with JAX
    (1 min) Hi there. I've been interested in evolutionary strategies for a while, and OpenAI's paper on them shows that they can be a pretty interesting alternative to RL. Now that I've worked with JAX a bit, I decided to put together a tutorial showcasing how I used its vectorization and parallelization to implement evolution pretty easily and get good performance on GPU/TPU devices. https://github.com/ttt733/nn-evolution-jax/blob/main/tutorial.md This isn't a tutorial on achieving results on an interesting task, but I just wanted to write about an optimization technique I think is somewhat overlooked. The focus is on walking through how I used JAX's functions to implement it. I don't go into the math behind evolutionary strategies because there are already lots of articles on particular strategies (like CMA-ES) and their performance, how they work, etc. I just use the simplistic strategies OpenAI called out in their post and used in their MuJoCo-solving code - you're free to experiment with more complex ones. And, if you know JAX better than I do, please feel free to correct me on anything I got wrong. (I know some of the vectorization stuff can get weird, so there's a chance that I vmap'd a little too aggressively.) Hope you enjoy! submitted by /u/kulili [link] [comments]
    [P] I made an AI twitter bot that draws people’s dream jobs for them.
    (2 min) submitted by /u/maaartiin_mac [link] [comments]
    [R] MatrixCalculus.org -- A tool for automatically computing Matrix Derivatives
    (1 min) I came across this site, from Andrew Gelman's blog (https://statmodeling.stat.columbia.edu/2020/06/03/online-matrix-derivative-calculator-vs-matrix-normal-stan/): http://www.matrixcalculus.org/ It'll do matrix calculus for you, and not only provide you the result, you can export it into either LaTeX or produce Python code! It's super useful for when you have to precompute things by "hand." There's an associated NeurIPS paper if you want the gory details: http://www.matrixcalculus.org/matrixcalculus.pdf submitted by /u/bikeskata [link] [comments]
    [D]Are there any effective machine learning methods that aren't copied from nature?
    (2 min) Recently have been struck by the fact that the two most powerful machine learning methods, neural networks and genetic algorithms, are partly just copied from nature (in concept at least, obviously was a ton of work by a lot of brilliant people). I guess there is a lot of machine learning that is basically just memorization or math but are there any other deep learning algorithm paradigms out there? Do you think there could be someday? submitted by /u/naxpouse [link] [comments]
    [D] Is there an open-source implementation of the Retrieval-Enhanced Transformer (RETRO)?
    (1 min) Wondering if there's an open-source implementation of the Retrieval-Enhanced Transformer (RETRO) [arxiv link]? I imagine that it may take some time to make its way to production libraries like Huggingface Transformers due to the external memory component. Anyone heard of people working on this? submitted by /u/bigvenn [link] [comments]
    [R] Detecting Twenty-thousand Classes using Image-level Supervision
    (0 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [P] Built a dog poop detector for my backyard
    (2 min) Over winter break I started poking around online for ways to track dog poop in my backyard. I don't like having to walk around and hope I picked up all of it. Where I live it snows a lot, and poops get lost in the snow come new snowfall. I found some cool concept gadgets that people have made, but nothing that worked with just a security cam. So I built this poop detector and made a video about it. When some code I wrote detects my dog pooping it will remember the location and draw a circle where my dog pooped on a picture of my backyard. So over the course of a couple of months I have a bunch of circle on a picture of my backyard, where all my dog's poops are. So this coming spring I will know where to look! Check out the video if you care: https://www.youtube.com/watch?v=uWZu3rnj-kQ Figured I would share here, it was fun to work on. Is this something you would hook up to a security camera if it was simple? Curious. Also, check out DeepLabCut. My project wouldn't have been possible without it, and it's really cool: https://github.com/DeepLabCut/DeepLabCut submitted by /u/GoochCommander [link] [comments]
    [R] Deep Symbolic Regression for Recurrent Sequences
    (0 min) submitted by /u/hardmaru [link] [comments]
    [P] Simple Tensorflow Implementation of "Toward Spatially Unbiased Generative Models (ICCV 2021)"
    (1 min) ​ samples.gif Abstract Recent image generation models show remarkable generation performance. However, they mirror strong location preference in datasets, which we call spatial bias. Therefore, generators render poor samples at unseen locations and scales. We argue that the generators rely on their implicit positional encoding to render spatial content. From our observations, the generator’s implicit positional encoding is translation-variant, making the generator spatially biased. To address this issue, we propose injecting explicit positional encoding at each scale of the generator. By learning the spatially unbiased generator, we facilitate the robust use of generators in multiple tasks, such as GAN inversion, multi-scale generation, generation of arbitrary sizes and aspect ratios. Furthermore, we show that our method can also be applied to denoising diffusion probabilistic models. code : https://github.com/taki0112/Toward_spatial_unbiased-Tensorflow paper : https://arxiv.org/abs/2108.01285 submitted by /u/taki0112 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Need a book about artificial intelligence/neural network/soft computing written by an Indian author. If you have pdf please share!
    (1 min) Any book will suffice for me. ​ But the books that I am looking for is this-: ​ [https://www.amazon.com/Introduction-Neural-Networks-Matlab-6-0/dp/0070591121](https://www.amazon.com/Introduction-Neural-Networks-Matlab-6-0/dp/0070591121) ​ [https://b-ok.asia/book/11849928/95de3e](https://b-ok.asia/book/11849928/95de3e) ​ There is 1 book in [b-ok.org](https://b-ok.org) but some guy has just posted the google books preview pdf of this book. ​ lol. ​ **If you studied artificial intelligence/soft computing/neural networks in university can you share your book you used to pass university exam questions? I am from Nepal and our syllabus is copied from those Indian university colleges. So having their book would be helpful. But the thing is that Indian author's textbook are not available in Nepal lol.** submitted by /u/barnyard9 [link] [comments]
  • Reinforcement Learning

    Seeking advice on my custom env (demo included)
    (1 min) Hi ! I m looking for some advices. I created a custom gym environement and ran it with stable baseline's PPO. My environement is rather simple I believe. Its a 2d Space with two dots in it. One dot is the agent ( blue dot) Another dot (red) is freely travelling vertically or horizontally (randomly selected at env init). The goal is to catch the red dot. The episode is considered done when the distance between both dot is under a small value. The observation space is continuous (x / y between 0 and 1) The action space is discrete ( up down left right) demo The agent is rather good at "chasing" but not good at "planning". For example it could position itself on the inbound trajectory of the red dot to "intercept" it but it fails to do so. Any idea how to get to that behavior ? Code https://github.com/Ouassimf/rl-tracker-2d Thanks for advices ! submitted by /u/Ouassimf [link] [comments]
    Expected returns don't match with actions, because the agent ouputs a 4x1 vector action space at each time step
    (1 min) I am working with an environment with a continuous action space. The agent needs to output 4 continuous values for the action space. I'm using REINFORCE, so I built a small net that, given a (31,)-sized observation space, outputs a 4x1 vector (https://www.pettingzoo.ml/sisl/multiwalker). My loss is: loss= -torch.sum(torch.log(prob_batch)*expected_returns_batch) So it requires an array of action probabilities for the actions that were taken and the discounted rewards. For this reason, I recomputes the action probabilities for all the states in the trajectory and subsets the action-probabilities associated with the actions that were actually taken with the following two lines of code: pred_batch = model(state_batch) prob_batch = pred_batch.gather(dim=1,index=action_batch .long().view(-1,1)).squeeze() However, I receive this error: RuntimeError: Size does not match at dimension 0 expected index [1116, 1] to be smaller than self [279, 4] apart from dimension 1 So the problem seems to be that the agent emits four actions at each time step, but the code expects one action only (1116 is indeed four times bigger than 279). How do I fix this? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    Outputting a 4x1 continuous action space according to a probability distribution
    (1 min) Hi all, I am working with an environment with a continuous action space. The agent needs to output 4 continuous values for the action space. I'm using REINFORCE, so I built a small net that, given a (31,)-sized observation space, outputs a 4x1 vector (https://www.pettingzoo.ml/sisl/multiwalker). The reason why I'm getting confused is that, if the possible actions were 4 and the agent had to select 1, I could use that probability distribution to choose the one with the highest value. However, here the agent needs to output a 4x1 action space, so I'm not sure how to use the probability distribution. Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    Action space for multi units controlled by a single agent
    (1 min) Hi everyone, I have a gym environment where there are multiple units controlled by a single agent. These units can also create new units and the units may also die. Since the number of units may vary, I am wondering how to make an action space if my Agent have to take actions for each units in a single step. submitted by /u/Coder_Unknown [link] [comments]
    Can Rainbow learn a stochastic policy?
    (1 min) Hi, so I'm developing an RL algorithm, and I have a problem what to choosing Rainbow or PPO. According to Sutton and Barto's book, the DQN fails to learn a stochastic policy i.e. while in the state or observation 's' acting optimally would be taking action a1 with 60% probability and taking action b with 40 probability. Now since the DQN outputs always the optimal action (based on its value) and not the distribution of action probabilities I see that DQN can't learn optimal stochastic policy? I'm unfamiliar with Rainbow implementation, but does it solve that problem of DQN learning deterministic policy? submitted by /u/basic_r_user [link] [comments]
    PPO BipedalWalker-v3 not converging enough
    (2 min) I am training BipedalWalker-v3 agent with PPO algorithm and I cant get my agent to get reward 300 or higher. I implemented basic PPO with several upgrades: GAE Normalized advantage Value Function Loss Clipping Overall Loss Includes Entropy Loss Adam Learning Rate Annealing Mini-batch Updates The Epsilon Parameter of Adam Optimizer set to 1e-5 Normalize and clip observation Scale and clip reward One NN for policy, one for critic Results I get are shown in table below where Y axis is reward, X is one step (2048 steps, 10 updates). https://preview.redd.it/kf7jpy9qfub81.png?width=1710&format=png&auto=webp&s=784e49af512ae2e8534152461ff251de701dee48 I get really close, but never to over 300. Let alone average 300 reward over 100 episodes which is the goal. I tried to tune my hyperparameters but wasn't successful. I can provide other graphs, like policy loss, critic loss or mean 100 reward if that helps, but I need help to read these graphs so I can change my hyperparamaters or maybe implementation. Best i managed was reward of 300.1 in one step (and never again) and mean reward of 282 when test process tested NNs for 100 episodes. submitted by /u/TheGuy839 [link] [comments]
    Any tips for training policy gradient on solving mazes?
    (1 min) submitted by /u/Mithrandir2k16 [link] [comments]
    Question about Uber's POET (specifically, which version of BipedalWalker they used)
    (1 min) I'm wondering if their code used V2 or V3. I will ask them if no one here knows, but I thought I would be faster to ask here. Apparently V2 had a pretty terrible bug with the LIDARs, so I don't want to reuse POET's code before making sure they used the right version Thanks! submitted by /u/DuchessOfNull [link] [comments]
    Definition of an Episodic Task
    (1 min) Hi, Just wondering if anyone has come across a formal definition for episodic tasks in reinforcement learning? Sutton's (excellent) book talks about episodic tasks but seems to not formally define them. The questions I'm trying to answer are: (1) does episodic imply episodes are time-limited to some (known) length H (one author uses this)? (2) Does it just imply termination in a finite (but arbitarly large) number of steps? (3) Does it imply neither of these? My thinking is that ergodic + at least one terminal state + a finite number of states implies 2 (but an infinite number of states does not). Thanks submitted by /u/VirtualHat [link] [comments]

2022-01-14

  • cs.LG updates on arXiv.org

    Eigenvalue Distribution of Large Random Matrices Arising in Deep Neural Networks: Orthogonal Case. (arXiv:2201.04543v1 [stat.ML])
    (2 min) The paper deals with the distribution of singular values of the input-output Jacobian of deep untrained neural networks in the limit of their infinite width. The Jacobian is the product of random matrices where the independent rectangular weight matrices alternate with diagonal matrices whose entries depend on the corresponding column of the nearest neighbor weight matrix. The problem was considered in \cite{Pe-Co:18} for the Gaussian weights and biases and also for the weights that are Haar distributed orthogonal matrices and Gaussian biases. Basing on a free probability argument, it was claimed that in these cases the singular value distribution of the Jacobian in the limit of infinite width (matrix size) coincides with that of the analog of the Jacobian with special random but weight independent diagonal matrices, the case well known in random matrix theory. The claim was rigorously proved in \cite{Pa-Sl:21} for a quite general class of weights and biases with i.i.d. (including Gaussian) entries by using a version of the techniques of random matrix theory. In this paper we use another version of the techniques to justify the claim for random Haar distributed weight matrices and Gaussian biases.
    Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data. (arXiv:2112.07891v3 [cs.SD] UPDATED)
    (2 min) Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.
    One-Step Abductive Multi-Target Learning with Diverse Noisy Samples: An Application to Tumour Segmentation for Breast Cancer. (arXiv:2110.10325v2 [cs.LG] UPDATED)
    (0 min) One-step abductive multi-target learning (OSAMTL) is an approach proposed to handle complex noisy labels. However, OSAMTL is not suitable for the situation where diverse noisy samples (DNS) are provided for a learning task. In this paper, giving definition of DNS, we propose one-step abductive multi-target learning with DNS (OSAMTL-DNS) to expand the original OSAMTL to a wider range of tasks that handle complex noisy labels. Applying OSAMTL-DNS to tumour segmentation for breast cancer in medical histopathology whole slide image analysis, we show that OSAMTL-DNS is able to enable various state-of-the-art approaches for learning from noisy labels to achieve significantly more rational predictions.
    Smoothness and continuity of cost functionals for ECG mismatch computation. (arXiv:2201.04487v1 [physics.med-ph])
    (0 min) The field of cardiac electrophysiology tries to abstract, describe and finally model the electrical characteristics of a heartbeat. With recent advances in cardiac electrophysiology, models have become more powerful and descriptive as ever. However, to advance to the field of inverse electrophysiological modeling, i.e. creating models from electrical measurements such as the ECG, the less investigated field of smoothness of the simulated ECGs w.r.t. model parameters need to be further explored. The present paper discusses smoothness in terms of the whole pipeline which describes how from physiological parameters, we arrive at the simulated ECG. Employing such a pipeline, we create a test-bench of a simplified idealized left ventricle model and demonstrate the most important factors for efficient inverse modeling through smooth cost functionals. Such knowledge will be important for designing and creating inverse models in future optimization and machine learning methods.
    Manifold learning via quantum dynamics. (arXiv:2112.11161v2 [quant-ph] UPDATED)
    (0 min) We introduce an algorithm for computing geodesics on sampled manifolds that relies on simulation of quantum dynamics on a graph embedding of the sampled data. Our approach exploits classic results in semiclassical analysis and the quantum-classical correspondence, and forms a basis for techniques to learn the manifold from which a dataset is sampled, and subsequently for nonlinear dimensionality reduction of high-dimensional datasets. We illustrate the new algorithm with data sampled from model manifolds and also by a clustering demonstration based on COVID-19 mobility data. Finally, our method reveals interesting connections between the discretization provided by data sampling and quantization.
    A Feature Extraction based Model for Hate Speech Identification. (arXiv:2201.04227v1 [cs.CL])
    (2 min) The detection of hate speech online has become an important task, as offensive language such as hurtful, obscene and insulting content can harm marginalized people or groups. This paper presents TU Berlin team experiments and results on the task 1A and 1B of the shared task on hate speech and offensive content identification in Indo-European languages 2021. The success of different Natural Language Processing models is evaluated for the respective subtasks throughout the competition. We tested different models based on recurrent neural networks in word and character levels and transfer learning approaches based on Bert on the provided dataset by the competition. Among the tested models that have been used for the experiments, the transfer learning-based models achieved the best results in both subtasks.
    Blackbox Post-Processing for Multiclass Fairness. (arXiv:2201.04461v1 [cs.LG])
    (2 min) Applying standard machine learning approaches for classification can produce unequal results across different demographic groups. When then used in real-world settings, these inequities can have negative societal impacts. This has motivated the development of various approaches to fair classification with machine learning models in recent years. In this paper, we consider the problem of modifying the predictions of a blackbox machine learning classifier in order to achieve fairness in a multiclass setting. To accomplish this, we extend the 'post-processing' approach in Hardt et al. 2016, which focuses on fairness for binary classification, to the setting of fair multiclass classification. We explore when our approach produces both fair and accurate predictions through systematic synthetic experiments and also evaluate discrimination-fairness tradeoffs on several publicly available real-world application datasets. We find that overall, our approach produces minor drops in accuracy and enforces fairness when the number of individuals in the dataset is high relative to the number of classes and protected groups.
    Flexible, Non-parametric Modeling Using Regularized Neural Networks. (arXiv:2012.11369v3 [cs.LG] UPDATED)
    (2 min) Non-parametric, additive models are able to capture complex data dependencies in a flexible, yet interpretable way. However, choosing the format of the additive components often requires non-trivial data exploration. Here, as an alternative, we propose PrAda-net, a one-hidden-layer neural network, trained with proximal gradient descent and adaptive lasso. PrAda-net automatically adjusts the size and architecture of the neural network to reflect the complexity and structure of the data. The compact network obtained by PrAda-net can be translated to additive model components, making it suitable for non-parametric statistical modelling with automatic model selection. We demonstrate PrAda-net on simulated data, where wecompare the test error performance, variable importance and variable subset identification properties of PrAda-net to other lasso-based regularization approaches for neural networks. We also apply PrAda-net to the massive U.K. black smoke data set, to demonstrate how PrAda-net can be used to model complex and heterogeneous data with spatial and temporal components. In contrast to classical, statistical non-parametric approaches, PrAda-net requires no preliminary modeling to select the functional forms of the additive components, yet still results in an interpretable model representation.
    SLISEMAP: Explainable Dimensionality Reduction. (arXiv:2201.04455v1 [cs.LG])
    (0 min) Existing explanation methods for black-box supervised learning models generally work by building local models that explain the models behaviour for a particular data item. It is possible to make global explanations, but the explanations may have low fidelity for complex models. Most of the prior work on explainable models has been focused on classification problems, with less attention on regression. We propose a new manifold visualization method, SLISEMAP, that at the same time finds local explanations for all of the data items and builds a two-dimensional visualization of model space such that the data items explained by the same model are projected nearby. We provide an open source implementation of our methods, implemented by using GPU-optimized PyTorch library. SLISEMAP works both on classification and regression models. We compare SLISEMAP to most popular dimensionality reduction methods and some local explanation methods. We provide mathematical derivation of our problem and show that SLISEMAP provides fast and stable visualizations that can be used to explain and understand black box regression and classification models.
    On generalization bounds for deep networks based on loss surface implicit regularization. (arXiv:2201.04545v1 [stat.ML])
    (0 min) The classical statistical learning theory says that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. The implicit regularization induced by stochastic gradient descent (SGD) has been regarded to be important, but its specific principle is still unknown. In this work, we study how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima. Our work attempts to better connect non-convex optimization and generalization analysis with uniform convergence.
    Test for non-negligible adverse shifts. (arXiv:2107.02990v3 [stat.ML] UPDATED)
    (2 min) Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences when there is in fact adequate sample coverage and predictive performance. We propose instead a framework to detect adverse dataset shifts based on outlier scores, $\texttt{D-SOS}$ for short. $\texttt{D-SOS}$ holds that the new (test) sample is not substantively worse than the reference (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates at varying weighted thresholds. Users can define what $\it{worse}$ means in terms of relevant notions of outlyingness, including proxies for predictive performance. Compared to tests of equal distribution, our approach is uniquely tailored to serve as a robust metric for model monitoring and data validation. We show how versatile and practical $\texttt{D-SOS}$ is on a wide range of real and simulated data.
    Extensions to the Proximal Distance Method of Constrained Optimization. (arXiv:2009.00801v2 [math.OC] UPDATED)
    (2 min) The current paper studies the problem of minimizing a loss $f(\boldsymbol{x})$ subject to constraints of the form $\boldsymbol{D}\boldsymbol{x} \in S$, where $S$ is a closed set, convex or not, and $\boldsymbol{D}$ is a matrix that fuses parameters. Fusion constraints can capture smoothness, sparsity, or more general constraint patterns. To tackle this generic class of problems, we combine the Beltrami-Courant penalty method with the proximal distance principle. The latter is driven by minimization of penalized objectives $f(\boldsymbol{x})+\frac{\rho}{2}\text{dist}(\boldsymbol{D}\boldsymbol{x},S)^2$ involving large tuning constants $\rho$ and the squared Euclidean distance of $\boldsymbol{D}\boldsymbol{x}$ from $S$. The next iterate $\boldsymbol{x}_{n+1}$ of the corresponding proximal distance algorithm is constructed from the current iterate $\boldsymbol{x}_n$ by minimizing the majorizing surrogate function $f(\boldsymbol{x})+\frac{\rho}{2}\|\boldsymbol{D}\boldsymbol{x}-\mathcal{P}_{S}(\boldsymbol{D}\boldsymbol{x}_n)\|^2$. For fixed $\rho$ and a subanalytic loss $f(\boldsymbol{x})$ and a subanalytic constraint set $S$, we prove convergence to a stationary point. Under stronger assumptions, we provide convergence rates and demonstrate linear local convergence. We also construct a steepest descent (SD) variant to avoid costly linear system solves. To benchmark our algorithms, we compare against the alternating direction method of multipliers (ADMM). Our extensive numerical tests include problems on metric projection, convex regression, convex clustering, total variation image denoising, and projection of a matrix to a good condition number. These experiments demonstrate the superior speed and acceptable accuracy of our steepest variant on high-dimensional problems.
    Detecting Ransomware Execution in a Timely Manner. (arXiv:2201.04424v1 [cs.CR])
    (2 min) Ransomware has been an ongoing issue since the early 1990s. In recent times ransomware has spread from traditional computational resources to cyber-physical systems and industrial controls. We devised a series of experiments in which virtual instances are infected with ransomware. We instrumented the instances and collected resource utilization data across a variety of metrics (CPU, Memory, Disk Utility). We design a change point detection and learning method for identifying ransomware execution. Finally we evaluate and demonstrate its ability to detect ransomware efficiently in a timely manner when trained on a minimal set of samples. Our results represent a step forward for defense, and we conclude with further remarks for the path forward.
    Predicting Alzheimer's Disease Using 3DMgNet. (arXiv:2201.04370v1 [eess.IV])
    (2 min) Alzheimer's disease (AD) is an irreversible neurode generative disease of the brain.The disease may causes memory loss, difficulty communicating and disorientation. For the diagnosis of Alzheimer's disease, a series of scales are often needed to evaluate the diagnosis clinically, which not only increases the workload of doctors, but also makes the results of diagnosis highly subjective. Therefore, for Alzheimer's disease, imaging means to find early diagnostic markers has become a top priority. In this paper, we propose a novel 3DMgNet architecture which is a unified framework of multigrid and convolutional neural network to diagnose Alzheimer's disease (AD). The model is trained using an open dataset (ADNI dataset) and then test with a smaller dataset of ours. Finally, the model achieved 92.133% accuracy for AD vs NC classification and significantly reduced the model parameters.
    Online Continual Learning under Extreme Memory Constraints. (arXiv:2008.01510v3 [cs.CV] UPDATED)
    (2 min) Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences. In this paper, we introduce the novel problem of Memory-Constrained Online Continual Learning (MC-OCL) which imposes strict constraints on the memory overhead that a possible algorithm can use to avoid catastrophic forgetting. As most, if not all, previous CL methods violate these constraints, we propose an algorithmic solution to MC-OCL: Batch-level Distillation (BLD), a regularization-based CL approach, which effectively balances stability and plasticity in order to learn from data streams, while preserving the ability to solve old tasks through distillation. Our extensive experimental evaluation, conducted on three publicly available benchmarks, empirically demonstrates that our approach successfully addresses the MC-OCL problem and achieves comparable accuracy to prior distillation methods requiring higher memory overhead.
    Careful! Training Relevance is Real. (arXiv:2201.04429v1 [cs.LG])
    (2 min) There is a recent proliferation of research on the integration of machine learning and optimization. One expansive area within this research stream is predictive-model embedded optimization, which uses pre-trained predictive models for the objective function of an optimization problem, so that features of the predictive models become decision variables in the optimization problem. Despite a recent surge in publications in this area, one aspect of this decision-making pipeline that has been largely overlooked is training relevance, i.e., ensuring that solutions to the optimization problem should be similar to the data used to train the predictive models. In this paper, we propose constraints designed to enforce training relevance, and show through a collection of experimental results that adding the suggested constraints significantly improves the quality of solutions obtained.
    HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning. (arXiv:2201.04182v1 [cs.LG])
    (2 min) In this work we propose a HyperTransformer, a transformer-based model for few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity transformer model, we effectively decouple the complexity of the large task space from the complexity of individual tasks. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal and better performance is attained when the information about the task can modulate all model parameters. For larger models we discover that generating the last layer alone allows us to produce competitive or better results than those obtained with state-of-the-art methods while being end-to-end differentiable. Finally, we extend our approach to a semi-supervised regime utilizing unlabeled samples in the support set and further improving few-shot performance.
    Real-Time Style Modelling of Human Locomotion via Feature-Wise Transformations and Local Motion Phases. (arXiv:2201.04439v1 [cs.GR])
    (2 min) Controlling the manner in which a character moves in a real-time animation system is a challenging task with useful applications. Existing style transfer systems require access to a reference content motion clip, however, in real-time systems the future motion content is unknown and liable to change with user input. In this work we present a style modelling system that uses an animation synthesis network to model motion content based on local motion phases. An additional style modulation network uses feature-wise transformations to modulate style in real-time. To evaluate our method, we create and release a new style modelling dataset, 100STYLE, containing over 4 million frames of stylised locomotion data in 100 different styles that present a number of challenges for existing systems. To model these styles, we extend the local phase calculation with a contact-free formulation. In comparison to other methods for real-time style modelling, we show our system is more robust and efficient in its style representation while improving motion quality.
    RGRecSys: A Toolkit for Robustness Evaluation of Recommender Systems. (arXiv:2201.04399v1 [cs.IR])
    (2 min) Robust machine learning is an increasingly important topic that focuses on developing models resilient to various forms of imperfect data. Due to the pervasiveness of recommender systems in online technologies, researchers have carried out several robustness studies focusing on data sparsity and profile injection attacks. Instead, we propose a more holistic view of robustness for recommender systems that encompasses multiple dimensions - robustness with respect to sub-populations, transformations, distributional disparity, attack, and data sparsity. While there are several libraries that allow users to compare different recommender system models, there is no software library for comprehensive robustness evaluation of recommender system models under different scenarios. As our main contribution, we present a robustness evaluation toolkit, Robustness Gym for RecSys (RGRecSys -- https://www.github.com/salesforce/RGRecSys), that allows us to quickly and uniformly evaluate the robustness of recommender system models.
    Dynamic Price of Parking Service based on Deep Learning. (arXiv:2201.04188v1 [cs.LG])
    (0 min) The improvement of air-quality in urban areas is one of the main concerns of public government bodies. This concern emerges from the evidence between the air quality and the public health. Major efforts from government bodies in this area include monitoring and forecasting systems, banning more pollutant motor vehicles, and traffic limitations during the periods of low-quality air. In this work, a proposal for dynamic prices in regulated parking services is presented. The dynamic prices in parking service must discourage motor vehicles parking when low-quality episodes are predicted. For this purpose, diverse deep learning strategies are evaluated. They have in common the use of collective air-quality measurements for forecasting labels about air quality in the city. The proposal is evaluated by using economic parameters and deep learning quality criteria at Madrid (Spain).
    The Turing Trap: The Promise & Peril of Human-Like Artificial Intelligence. (arXiv:2201.04200v1 [econ.GN])
    (0 min) In 1950, Alan Turing proposed an imitation game as the ultimate test of whether a machine was intelligent: could a machine imitate a human so well that its answers to questions indistinguishable from a human. Ever since, creating intelligence that matches human intelligence has implicitly or explicitly been the goal of thousands of researchers, engineers, and entrepreneurs. The benefits of human-like artificial intelligence (HLAI) include soaring productivity, increased leisure, and perhaps most profoundly, a better understanding of our own minds. But not all types of AI are human-like. In fact, many of the most powerful systems are very different from humans. So an excessive focus on developing and deploying HLAI can lead us into a trap. As machines become better substitutes for human labor, workers lose economic and political bargaining power and become increasingly dependent on those who control the technology. In contrast, when AI is focused on augmenting humans rather than mimicking them, then humans retain the power to insist on a share of the value created. Furthermore, augmentation creates new capabilities and new products and services, ultimately generating far more value than merely human-like AI. While both types of AI can be enormously beneficial, there are currently excess incentives for automation rather than augmentation among technologists, business executives, and policymakers.
    Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD. (arXiv:2201.04301v1 [cs.IT])
    (0 min) Wall-clock convergence time and communication load are key performance metrics for the distributed implementation of stochastic gradient descent (SGD) in parameter server settings. Communication-adaptive distributed Adam (CADA) has been recently proposed as a way to reduce communication load via the adaptive selection of workers. CADA is subject to performance degradation in terms of wall-clock convergence time in the presence of stragglers. This paper proposes a novel scheme named grouping-based CADA (G-CADA) that retains the advantages of CADA in reducing the communication load, while increasing the robustness to stragglers at the cost of additional storage at the workers. G-CADA partitions the workers into groups of workers that are assigned the same data shards. Groups are scheduled adaptively at each iteration, and the server only waits for the fastest worker in each selected group. We provide analysis and experimental results to elaborate the significant gains on the wall-clock time, as well as communication load and computation load, of G-CADA over other benchmark schemes.
    Evolutionary Action Selection for Gradient-based Policy Learning. (arXiv:2201.04286v1 [cs.NE])
    (2 min) Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) have recently been combined to integrate the advantages of the two solutions for better policy learning. However, in existing hybrid methods, EA is used to directly train the policy network, which will lead to sample inefficiency and unpredictable impact on the policy performance. To better integrate these two approaches and avoid the drawbacks caused by the introduction of EA, we devote ourselves to devising a more efficient and reasonable method of combining EA and DRL. In this paper, we propose Evolutionary Action Selection-Twin Delayed Deep Deterministic Policy Gradient (EAS-TD3), a novel combination of EA and DRL. In EAS, we focus on optimizing the action chosen by the policy network and attempt to obtain high-quality actions to guide policy learning through an evolutionary algorithm. We conduct several experiments on challenging continuous control tasks. The result shows that EAS-TD3 shows superior performance over other state-of-art methods.
    Two Wrongs Can Make a Right: A Transfer Learning Approach for Chemical Discovery with Chemical Accuracy. (arXiv:2201.04243v1 [physics.chem-ph])
    (2 min) Appropriately identifying and treating molecules and materials with significant multi-reference (MR) character is crucial for achieving high data fidelity in virtual high throughput screening (VHTS). Nevertheless, most VHTS is carried out with approximate density functional theory (DFT) using a single functional. Despite development of numerous MR diagnostics, the extent to which a single value of such a diagnostic indicates MR effect on chemical property prediction is not well established. We evaluate MR diagnostics of over 10,000 transition metal complexes (TMCs) and compare to those in organic molecules. We reveal that only some MR diagnostics are transferable across these materials spaces. By studying the influence of MR character on chemical properties (i.e., MR effect) that involves multiple potential energy surfaces (i.e., adiabatic spin splitting, $\Delta E_\mathrm{H-L}$, and ionization potential, IP), we observe that cancellation in MR effect outweighs accumulation. Differences in MR character are more important than the total degree of MR character in predicting MR effect in property prediction. Motivated by this observation, we build transfer learning models to directly predict CCSD(T)-level adiabatic $\Delta E_\mathrm{H-L}$ and IP from lower levels of theory. By combining these models with uncertainty quantification and multi-level modeling, we introduce a multi-pronged strategy that accelerates data acquisition by at least a factor of three while achieving chemical accuracy (i.e., 1 kcal/mol) for robust VHTS.
    Intra-domain and cross-domain transfer learning for time series data -- How transferable are the features?. (arXiv:2201.04449v1 [cs.LG])
    (2 min) In practice, it is very demanding and sometimes impossible to collect datasets of tagged data large enough to successfully train a machine learning model, and one possible solution to this problem is transfer learning. This study aims to assess how transferable are the features between different domains of time series data and under which conditions. The effects of transfer learning are observed in terms of predictive performance of the models and their convergence rate during training. In our experiment, we use reduced data sets of 1,500 and 9,000 data instances to mimic real world conditions. Using the same scaled-down datasets, we trained two sets of machine learning models: those that were trained with transfer learning and those that were trained from scratch. Four machine learning models were used for the experiment. Transfer of knowledge was performed within the same domain of application (seismology), as well as between mutually different domains of application (seismology, speech, medicine, finance). We observe the predictive performance of the models and the convergence rate during the training. In order to confirm the validity of the obtained results, we repeated the experiments seven times and applied statistical tests to confirm the significance of the results. The general conclusion of our study is that transfer learning is very likely to either increase or not negatively affect the predictive performance of the model or its convergence rate. The collected data is analysed in more details to determine which source and target domains are compatible for transfer of knowledge. We also analyse the effect of target dataset size and the selection of model and its hyperparameters on the effects of transfer learning.
    Neural Capacitance: A New Perspective of Neural Network Selection via Edge Dynamics. (arXiv:2201.04194v1 [cs.LG])
    (2 min) Efficient model selection for identifying a suitable pre-trained neural network to a downstream task is a fundamental yet challenging task in deep learning. Current practice requires expensive computational costs in model training for performance prediction. In this paper, we propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training. Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections. Therefore, a converged neural network is associated with an equilibrium state of a networked system composed of those edges. To this end, we construct a network mapping $\phi$, converting a neural network $G_A$ to a directed line graph $G_B$ that is defined on those edges in $G_A$. Next, we derive a neural capacitance metric $\beta_{\rm eff}$ as a predictive measure universally capturing the generalization capability of $G_A$ on the downstream task using only a handful of early training results. We carried out extensive experiments using 17 popular pre-trained ImageNet models and five benchmark datasets, including CIFAR10, CIFAR100, SVHN, Fashion MNIST and Birds, to evaluate the fine-tuning performance of our framework. Our neural capacitance metric is shown to be a powerful indicator for model selection based only on early training results and is more efficient than state-of-the-art methods.
    On Training Implicit Models. (arXiv:2111.05177v4 [cs.LG] UPDATED)
    (2 min) This paper focuses on training implicit models of infinite layers. Specifically, previous works employ implicit differentiation and solve the exact gradient for the backward propagation. However, is it necessary to compute such an exact but expensive gradient for training? In this work, we propose a novel gradient estimate for implicit models, named phantom gradient, that 1) forgoes the costly computation of the exact gradient; and 2) provides an update direction empirically preferable to the implicit model training. We theoretically analyze the condition under which an ascent direction of the loss landscape could be found, and provide two specific instantiations of the phantom gradient based on the damped unrolling and Neumann series. Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times, and even boost the performance over approaches based on the exact gradient on ImageNet.
    Interacting with Explanations through Critiquing. (arXiv:2005.11067v4 [cs.CL] UPDATED)
    (2 min) Using personalized explanations to support recommendations has been shown to increase trust and perceived quality. However, to actually obtain better recommendations, there needs to be a means for users to modify the recommendation criteria by interacting with the explanation. We present a novel technique using aspect markers that learns to generate personalized explanations of recommendations from review texts, and we show that human users significantly prefer these explanations over those produced by state-of-the-art techniques. Our work's most important innovation is that it allows users to react to a recommendation by critiquing the textual explanation: removing (symmetrically adding) certain aspects they dislike or that are no longer relevant (symmetrically that are of interest). The system updates its user model and the resulting recommendations according to the critique. This is based on a novel unsupervised critiquing method for single- and multi-step critiquing with textual explanations. Experiments on two real-world datasets show that our system is the first to achieve good performance in adapting to the preferences expressed in multi-step critiquing.
    Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources in Unmapped 3D Environments. (arXiv:2201.04279v1 [cs.CV])
    (2 min) Recent work on audio-visual navigation targets a single static sound in noise-free audio environments and struggles to generalize to unheard sounds. We introduce the novel dynamic audio-visual navigation benchmark in which an embodied AI agent must catch a moving sound source in an unmapped environment in the presence of distractors and noisy sounds. We propose an end-to-end reinforcement learning approach that relies on a multi-modal architecture that fuses the spatial audio-visual information from a binaural audio signal and spatial occupancy maps to encode the features needed to learn a robust navigation policy for our new complex task settings. We demonstrate that our approach outperforms the current state-of-the-art with better generalization to unheard sounds and better robustness to noisy scenarios on the two challenging 3D scanned real-world datasets Replica and Matterport3D, for the static and dynamic audio-visual navigation benchmarks. Our novel benchmark will be made available at this http URL
    Predicting Terrorist Attacks in the United States using Localized News Data. (arXiv:2201.04292v1 [cs.LG])
    (2 min) Dozens of terrorist attacks are perpetrated in the United States every year, often causing fatalities and other significant damage. Toward the end of better understanding and mitigating these attacks, we present a set of machine learning models that learn from localized news data in order to predict whether a terrorist attack will occur on a given calendar date and in a given state. The best model--a Random Forest that learns from a novel variable-length moving average representation of the feature space--achieves area under the receiver operating characteristic scores $> .667$ on four of the five states that were impacted most by terrorism between 2015 and 2018. Our key findings include that modeling terrorism as a set of independent events, rather than as a continuous process, is a fruitful approach--especially when the events are sparse and dissimilar. Additionally, our results highlight the need for localized models that account for differences between locations. From a machine learning perspective, we found that the Random Forest model outperformed several deep models on our multimodal, noisy, and imbalanced data set, thus demonstrating the efficacy of our novel feature representation method in such a context. We also show that its predictions are relatively robust to time gaps between attacks and observed characteristics of the attacks. Finally, we analyze factors that limit model performance, which include a noisy feature space and small amount of available data. These contributions provide an important foundation for the use of machine learning in efforts against terrorism in the United States and beyond.
    Fighting Money-Laundering with Statistics and Machine Learning: An Introduction and Review. (arXiv:2201.04207v1 [stat.ML])
    (2 min) Money laundering is a profound, global problem. Nonetheless, there is little statistical and machine learning research on the topic. In this paper, we focus on anti-money laundering in banks. To help organize existing research in the field, we propose a unifying terminology and provide a review of the literature. This is structured around two central tasks: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. Suspicious behavior flagging, on the other hand, is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is the lack of public data sets. This may, potentially, be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability and fairness of the results.
    Learning Robust Policies for Generalized Debris Capture with an Automated Tether-Net System. (arXiv:2201.04180v1 [cs.RO])
    (2 min) Tether-net launched from a chaser spacecraft provides a promising method to capture and dispose of large space debris in orbit. This tether-net system is subject to several sources of uncertainty in sensing and actuation that affect the performance of its net launch and closing control. Earlier reliability-based optimization approaches to design control actions however remain challenging and computationally prohibitive to generalize over varying launch scenarios and target (debris) state relative to the chaser. To search for a general and reliable control policy, this paper presents a reinforcement learning framework that integrates a proximal policy optimization (PPO2) approach with net dynamics simulations. The latter allows evaluating the episodes of net-based target capture, and estimate the capture quality index that serves as the reward feedback to PPO2. Here, the learned policy is designed to model the timing of the net closing action based on the state of the moving net and the target, under any given launch scenario. A stochastic state transition model is considered in order to incorporate synthetic uncertainties in state estimation and launch actuation. Along with notable reward improvement during training, the trained policy demonstrates capture performance (over a wide range of launch/target scenarios) that is close to that obtained with reliability-based optimization run over an individual scenario.
    Robust Contrastive Learning against Noisy Views. (arXiv:2201.04309v1 [cs.CV])
    (2 min) Contrastive learning relies on an assumption that positive pairs contain related views, e.g., patches of an image or co-occurring multimodal signals of a video, that share certain underlying information about an instance. But what if this assumption is violated? The literature suggests that contrastive learning produces suboptimal representations in the presence of noisy views, e.g., false positive pairs with no apparent shared information. In this work, we propose a new contrastive loss function that is robust against noisy views. We provide rigorous theoretical justifications by showing connections to robust symmetric losses for noisy binary classification and by establishing a new contrastive bound for mutual information maximization based on the Wasserstein distance measure. The proposed loss is completely modality-agnostic and a simple drop-in replacement for the InfoNCE loss, which makes it easy to apply to existing contrastive frameworks. We show that our approach provides consistent improvements over the state-of-the-art on image, video, and graph contrastive learning benchmarks that exhibit a variety of real-world noise patterns.
    Improving Disentangled Text Representation Learning with Information-Theoretic Guidance. (arXiv:2006.00693v3 [cs.LG] UPDATED)
    (2 min) Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.
    Optimal Fixed-Budget Best Arm Identification using the Augmented Inverse Probability Estimator in Two-Armed Gaussian Bandits with Unknown Variances. (arXiv:2201.04469v1 [stat.ML])
    (2 min) We consider the fixed-budget best arm identification problem in two-armed Gaussian bandits with unknown variances. The tightest lower bound on the complexity and an algorithm whose performance guarantee matches the lower bound have long been open problems when the variances are unknown and when the algorithm is agnostic to the optimal proportion of the arm draws. In this paper, we propose a strategy comprising a sampling rule with randomized sampling (RS) following the estimated target allocation probabilities of arm draws and a recommendation rule using the augmented inverse probability weighting (AIPW) estimator, which is often used in the causal inference literature. We refer to our strategy as the RS-AIPW strategy. In the theoretical analysis, we first derive a large deviation principle for martingales, which can be used when the second moment converges in mean, and apply it to our proposed strategy. Then, we show that the proposed strategy is asymptotically optimal in the sense that the probability of misidentification achieves the lower bound by Kaufmann et al. (2016) when the sample size becomes infinitely large and the gap between the two arms goes to zero.
    Learning to Identify Top Elo Ratings: A Dueling Bandits Approach. (arXiv:2201.04480v1 [cs.LG])
    (2 min) The Elo rating system is widely adopted to evaluate the skills of (chess) game and sports players. Recently it has been also integrated into machine learning algorithms in evaluating the performance of computerised AI agents. However, an accurate estimation of the Elo rating (for the top players) often requires many rounds of competitions, which can be expensive to carry out. In this paper, to improve the sample efficiency of the Elo evaluation (for top players), we propose an efficient online match scheduling algorithm. Specifically, we identify and match the top players through a dueling bandits framework and tailor the bandit algorithm to the gradient-based update of Elo. We show that it reduces the per-step memory and time complexity to constant, compared to the traditional likelihood maximization approaches requiring $O(t)$ time. Our algorithm has a regret guarantee of $\tilde{O}(\sqrt{T})$, sublinear in the number of competition rounds and has been extended to the multidimensional Elo ratings for handling intransitive games. We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks.
    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. (arXiv:2201.04234v1 [cs.LG])
    (2 min) Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works.
    Embedding Principle of Loss Landscape of Deep Neural Networks. (arXiv:2105.14573v3 [cs.LG] UPDATED)
    (2 min) Understanding the structure of loss landscape of deep neural networks (DNNs)is obviously important. In this work, we prove an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/hyperplane of the target DNN with higher degeneracy and preserving the DNN output function. The embedding structure of critical points is independent of loss function and training data, showing a stark difference from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides an explanation for the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, our work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near
    Optimizing Prediction of MGMT Promoter Methylation from MRI Scans using Adversarial Learning. (arXiv:2201.04416v1 [eess.IV])
    (2 min) Glioblastoma Multiforme (GBM) is a malignant brain cancer forming around 48% of al brain and Central Nervous System (CNS) cancers. It is estimated that annually over 13,000 deaths occur in the US due to GBM, making it crucial to have early diagnosis systems that can lead to predictable and effective treatment. The most common treatment after GBM diagnosis is chemotherapy, which works by sending rapidly dividing cells to apoptosis. However, this form of treatment is not effective when the MGMT promoter sequence is methylated, and instead leads to severe side effects decreasing patient survivability. Therefore, it is important to be able to identify the MGMT promoter methylation status through non-invasive magnetic resonance imaging (MRI) based machine learning (ML) models. This is accomplished using the Brain Tumor Segmentation (BraTS) 2021 dataset, which was recently used for an international Kaggle competition. We developed four primary models - two radiomic models and two CNN models - each solving the binary classification task with progressive improvements. We built a novel ML model termed as the Intermediate State Generator which was used to normalize the slice thicknesses of all MRI scans. With further improvements, our best model was able to achieve performance significantly ($p < 0.05$) better than the best performing Kaggle model with a 6% increase in average cross-validation accuracy. This improvement could potentially lead to a more informed choice of chemotherapy as a treatment option, prolonging lives of thousands of patients with GBM each year.
    Fine-grained Graph Learning for Multi-view Subspace Clustering. (arXiv:2201.04604v1 [cs.LG])
    (2 min) Multi-view subspace clustering has conventionally focused on integrating heterogeneous feature descriptions to capture higher-dimensional information. One popular strategy is to generate a common subspace from different views and then apply graph-based approaches to deal with clustering. However, the performance of these methods is still subject to two limitations, namely the multiple views fusion pattern and the connection between the fusion process and clustering tasks. To address these problems, we propose a novel multi-view subspace clustering framework via fine-grained graph learning, which can tell the consistency of local structures between different views and integrate all views more delicately than previous weight regularizations. Different from other models in the literature, the point-level graph regularization and the reformulation of spectral clustering are introduced to perform graphs fusion and learn the shared cluster structure together. Extensive experiments on five real-world datasets show that the proposed framework has comparable performance to the SOTA algorithms.
    Dyna-T: Dyna-Q and Upper Confidence Bounds Applied to Trees. (arXiv:2201.04502v1 [cs.LG])
    (2 min) In this work we present a preliminary investigation of a novel algorithm called Dyna-T. In reinforcement learning (RL) a planning agent has its own representation of the environment as a model. To discover an optimal policy to interact with the environment, the agent collects experience in a trial and error fashion. Experience can be used for learning a better model or improve directly the value function and policy. Typically separated, Dyna-Q is an hybrid approach which, at each iteration, exploits the real experience to update the model as well as the value function, while planning its action using simulated data from its model. However, the planning process is computationally expensive and strongly depends on the dimensionality of the state-action space. We propose to build a Upper Confidence Tree (UCT) on the simulated experience and search for the best action to be selected during the on-line learning process. We prove the effectiveness of our proposed method on a set of preliminary tests on three testbed environments from Open AI. In contrast to Dyna-Q, Dyna-T outperforms state-of-the-art RL agents in the stochastic environments by choosing a more robust action selection strategy.
    Multi-task Joint Strategies of Self-supervised Representation Learning on Biomedical Networks for Drug Discovery. (arXiv:2201.04437v1 [cs.LG])
    (2 min) Self-supervised representation learning (SSL) on biomedical networks provides new opportunities for drug discovery which is lack of available biological or clinic phenotype. However, how to effectively combine multiple SSL models is challenging and rarely explored. Therefore, we propose multi-task joint strategies of self-supervised representation learning on biomedical networks for drug discovery, named MSSL2drug. We design six basic SSL tasks that are inspired by various modality features including structures, semantics, and attributes in biomedical heterogeneous networks. In addition, fifteen combinations of multiple tasks are evaluated by a graph attention-based adversarial multi-task learning framework in two drug discovery scenarios. The results suggest two important findings. (1) The combinations of multimodal tasks achieve the best performance compared to other multi-task joint strategies. (2) The joint training of local and global SSL tasks yields higher performance than random task combinations. Therefore, we conjecture that the multimodal and local-global combination strategies can be regarded as a guideline for multi-task SSL to drug discovery.
    Multiview Transformers for Video Recognition. (arXiv:2201.04288v1 [cs.CV])
    (2 min) Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining. We will release code and pretrained checkpoints.
    Preventing Manifold Intrusion with Locality: Local Mixup. (arXiv:2201.04368v1 [cs.LG])
    (2 min) Mixup is a data-dependent regularization technique that consists in linearly interpolating input samples and associated outputs. It has been shown to improve accuracy when used to train on standard machine learning datasets. However, authors have pointed out that Mixup can produce out-of-distribution virtual samples and even contradictions in the augmented training set, potentially resulting in adversarial effects. In this paper, we introduce Local Mixup in which distant input samples are weighted down when computing the loss. In constrained settings we demonstrate that Local Mixup can create a trade-off between bias and variance, with the extreme cases reducing to vanilla training and classical Mixup. Using standardized computer vision benchmarks , we also show that Local Mixup can improve test accuracy.
    Towards Adversarially Robust Deep Image Denoising. (arXiv:2201.04397v1 [eess.IV])
    (2 min) This work systematically investigates the adversarial robustness of deep image denoisers (DIDs), i.e, how well DIDs can recover the ground truth from noisy observations degraded by adversarial perturbations. Firstly, to evaluate DIDs' robustness, we propose a novel adversarial attack, namely Observation-based Zero-mean Attack ({\sc ObsAtk}), to craft adversarial zero-mean perturbations on given noisy images. We find that existing DIDs are vulnerable to the adversarial noise generated by {\sc ObsAtk}. Secondly, to robustify DIDs, we propose an adversarial training strategy, hybrid adversarial training ({\sc HAT}), that jointly trains DIDs with adversarial and non-adversarial noisy data to ensure that the reconstruction quality is high and the denoisers around non-adversarial data are locally smooth. The resultant DIDs can effectively remove various types of synthetic and adversarial noise. We also uncover that the robustness of DIDs benefits their generalization capability on unseen real-world noise. Indeed, {\sc HAT}-trained DIDs can recover high-quality clean images from real-world noise even without training on real noisy data. Extensive experiments on benchmark datasets, including Set68, PolyU, and SIDD, corroborate the effectiveness of {\sc ObsAtk} and {\sc HAT}.
    On the Statistical Complexity of Sample Amplification. (arXiv:2201.04315v1 [math.ST])
    (2 min) Given $n$ i.i.d. samples drawn from an unknown distribution $P$, when is it possible to produce a larger set of $n+m$ samples which cannot be distinguished from $n+m$ i.i.d. samples drawn from $P$? (Axelrod et al. 2019) formalized this question as the sample amplification problem, and gave optimal amplification procedures for discrete distributions and Gaussian location models. However, these procedures and associated lower bounds are tailored to the specific distribution classes, and a general statistical understanding of sample amplification is still largely missing. In this work, we place the sample amplification problem on a firm statistical foundation by deriving generally applicable amplification procedures, lower bound techniques and connections to existing statistical notions. Our techniques apply to a large class of distributions including the exponential family, and establish a rigorous connection between sample amplification and distribution learning.
    Brain Signals Analysis Based Deep Learning Methods: Recent advances in the study of non-invasive brain signals. (arXiv:2201.04229v1 [q-bio.NC])
    (2 min) Brain signals constitute the information that are processed by millions of brain neurons (nerve cells and brain cells). These brain signals can be recorded and analyzed using various of non-invasive techniques such as the Electroencephalograph (EEG), Magneto-encephalograph (MEG) as well as brain-imaging techniques such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT) and others, which will be discussed briefly in this paper. This paper discusses about the currently emerging techniques such as the usage of different Deep Learning (DL) algorithms for the analysis of these brain signals and how these algorithms will be helpful in determining the neurological status of a person by applying the signal decoding strategy.
    ECONet: Efficient Convolutional Online Likelihood Network for Scribble-based Interactive Segmentation. (arXiv:2201.04584v1 [eess.IV])
    (2 min) Automatic segmentation of lung lesions associated with COVID-19 in CT images requires large amount of annotated volumes. Annotations mandate expert knowledge and are time-intensive to obtain through fully manual segmentation methods. Additionally, lung lesions have large inter-patient variations, with some pathologies having similar visual appearance as healthy lung tissues. This poses a challenge when applying existing semi-automatic interactive segmentation techniques for data labelling. To address these challenges, we propose an efficient convolutional neural networks (CNNs) that can be learned online while the annotator provides scribble-based interaction. To accelerate learning from only the samples labelled through user-interactions, a patch-based approach is used for training the network. Moreover, we use weighted cross-entropy loss to address the class imbalance that may result from user-interactions. During online inference, the learned network is applied to the whole input volume using a fully convolutional approach. We compare our proposed method with state-of-the-art and show that it outperforms existing methods on the task of annotating lung lesions associated with COVID-19, achieving 16% higher Dice score while reducing execution time by 3$\times$ and requiring 9000 lesser scribbles-based labelled voxels. Due to the online learning aspect, our approach adapts quickly to user input, resulting in high quality segmentation labels. Source code will be made available upon acceptance.
    SmartDet: Context-Aware Dynamic Control of Edge Task Offloading for Mobile Object Detection. (arXiv:2201.04235v1 [cs.DC])
    (2 min) Mobile devices increasingly rely on object detection (OD) through deep neural networks (DNNs) to perform critical tasks. Due to their high complexity, the execution of these DNNs requires excessive time and energy. Low-complexity object tracking (OT) can be used with OD, where the latter is periodically applied to generate "fresh" references for tracking. However, the frames processed with OD incur large delays, which may make the reference outdated and degrade tracking quality. Herein, we propose to use edge computing in this context, and establish parallel OT (at the mobile device) and OD (at the edge server) processes that are resilient to large OD latency. We propose Katch-Up, a novel tracking mechanism that improves the system resilience to excessive OD delay. However, while Katch-Up significantly improves performance, it also increases the computing load of the mobile device. Hence, we design SmartDet, a low-complexity controller based on deep reinforcement learning (DRL) that learns controlling the trade-off between resource utilization and OD performance. SmartDet takes as input context-related information related to the current video content and the current network conditions to optimize frequency and type of OD offloading, as well as Katch-Up utilization. We extensively evaluate SmartDet on a real-world testbed composed of a JetSon Nano as mobile device and a GTX 980 Ti as edge server, connected through a Wi-Fi link. Experimental results show that SmartDet achieves an optimal balance between tracking performance - mean Average Recall (mAR) and resource usage. With respect to a baseline with full Katch-Upusage and maximum channel usage, we still increase mAR by 4% while using 50% less of the channel and 30% power resources associated with Katch-Up. With respect to a fixed strategy using minimal resources, we increase mAR by 20% while using Katch-Up on 1/3 of the frames.
    Exact learning and test theory. (arXiv:2201.04506v1 [cs.CC])
    (2 min) In this paper, based on results of exact learning and test theory, we study arbitrary infinite binary information systems each of which consists of an infinite set of elements and an infinite set of two-valued functions (attributes) defined on the set of elements. We consider the notion of a problem over information system, which is described by a finite number of attributes: for a given element, we should recognize values of these attributes. As algorithms for problem solving, we consider decision trees of two types: (i) using only proper hypotheses (an analog of proper equivalence queries from exact learning), and (ii) using both attributes and proper hypotheses. As time complexity, we study the depth of decision trees. In the worst case, with the growth of the number of attributes in the problem description, the minimum depth of decision trees of both types either is bounded from above by a constant or grows as a logarithm, or linearly. Based on these results and results obtained earlier for attributes and arbitrary hypotheses, we divide the set of all infinite binary information systems into seven complexity classes.
    De-Noising of Photoacoustic Microscopy Images by Deep Learning. (arXiv:2201.04302v1 [eess.IV])
    (2 min) As a hybrid imaging technology, photoacoustic microscopy (PAM) imaging suffers from noise due to the maximum permissible exposure of laser intensity, attenuation of ultrasound in the tissue, and the inherent noise of the transducer. De-noising is a post-processing method to reduce noise, and PAM image quality can be recovered. However, previous de-noising techniques usually heavily rely on mathematical priors as well as manually selected parameters, resulting in unsatisfactory and slow de-noising performance for different noisy images, which greatly hinders practical and clinical applications. In this work, we propose a deep learning-based method to remove complex noise from PAM images without mathematical priors and manual selection of settings for different input images. An attention enhanced generative adversarial network is used to extract image features and remove various noises. The proposed method is demonstrated on both synthetic and real datasets, including phantom (leaf veins) and in vivo (mouse ear blood vessels and zebrafish pigment) experiments. The results show that compared with previous PAM de-noising methods, our method exhibits good performance in recovering images qualitatively and quantitatively. In addition, the de-noising speed of 0.016 s is achieved for an image with $256\times256$ pixels. Our approach is effective and practical for the de-noising of PAM images.
    Deep Symbolic Regression for Recurrent Sequences. (arXiv:2201.04600v1 [cs.LG])
    (2 min) Symbolic regression, i.e. predicting a function from the observation of its values, is well-known to be a challenging task. In this paper, we train Transformers to infer the function or recurrence relation underlying sequences of integers or floats, a typical task in human IQ tests which has hardly been tackled in the machine learning literature. We evaluate our integer model on a subset of OEIS sequences, and show that it outperforms built-in Mathematica functions for recurrence prediction. We also demonstrate that our float model is able to yield informative approximations of out-of-vocabulary functions and constants, e.g. $\operatorname{bessel0}(x)\approx \frac{\sin(x)+\cos(x)}{\sqrt{\pi x}}$ and $1.644934\approx \pi^2/6$. An interactive demonstration of our models is provided at https://bit.ly/3niE5FS.
    An Efficient and Adaptive Granular-ball Generation Method in Classification Problem. (arXiv:2201.04343v1 [cs.LG])
    (2 min) Granular-ball computing is an efficient, robust, and scalable learning method for granular computing. The basis of granular-ball computing is the granular-ball generation method. This paper proposes a method for accelerating the granular-ball generation using the division to replace $k$-means. It can greatly improve the efficiency of granular-ball generation while ensuring the accuracy similar to the existing method. Besides, a new adaptive method for the granular-ball generation is proposed by considering granular-ball's overlap eliminating and some other factors. This makes the granular-ball generation process of parameter-free and completely adaptive in the true sense. In addition, this paper first provides the mathematical models for the granular-ball covering. The experimental results on some real data sets demonstrate that the proposed two granular-ball generation methods have similar accuracies with the existing method while adaptiveness or acceleration is realized.
    Benchmarking Energy-Conserving Neural Networks for Learning Dynamics from Data. (arXiv:2012.02334v5 [cs.LG] UPDATED)
    (2 min) The last few years have witnessed an increased interest in incorporating physics-informed inductive bias in deep learning frameworks. In particular, a growing volume of literature has been exploring ways to enforce energy conservation while using neural networks for learning dynamics from observed time-series data. In this work, we survey ten recently proposed energy-conserving neural network models, including HNN, LNN, DeLaN, SymODEN, CHNN, CLNN and their variants. We provide a compact derivation of the theory behind these models and explain their similarities and differences. Their performance are compared in 4 physical systems. We point out the possibility of leveraging some of these energy-conserving models to design energy-based controllers.
    UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control. (arXiv:2106.11171v2 [eess.AS] UPDATED)
    (2 min) We propose a novel high-fidelity expressive speech synthesis model, UniTTS, that learns and controls overlapping style attributes avoiding interference. UniTTS represents multiple style attributes in a single unified embedding space by the residuals between the phoneme embeddings before and after applying the attributes. The proposed method is especially effective in controlling multiple attributes that are difficult to separate cleanly, such as speaker ID and emotion, because it minimizes redundancy when adding variance in speaker ID and emotion, and additionally, predicts duration, pitch, and energy based on the speaker ID and emotion. In experiments, the visualization results exhibit that the proposed methods learned multiple attributes harmoniously in a manner that can be easily separated again. As well, UniTTS synthesized high-fidelity speech signals controlling multiple style attributes. The synthesized speech samples are presented at https://jackson-kang.github.io/paper_works/UniTTS/demos.
    Efficient and Interpretable Robot Manipulation with Graph Neural Networks. (arXiv:2102.13177v4 [cs.RO] UPDATED)
    (2 min) Manipulation tasks, like loading a dishwasher, can be seen as a sequence of spatial constraints and relationships between different objects. We aim to discover these rules from demonstrations by posing manipulation as a classification problem over a graph, whose nodes represent task-relevant entities like objects and goals, and present a graph neural network (GNN) policy architecture for solving this problem from demonstrations. In our experiments, a single GNN policy trained using imitation learning (IL) on 20 expert demos can solve blockstacking, rearrangement, and dishwasher loading tasks; once the policy has learned the spatial structure, it can generalize to a larger number of objects, goal configurations, and from simulation to the real world. These experiments show that graphical IL can solve complex long-horizon manipulation problems without requiring detailed task descriptions. Videos can be found at: https://youtu.be/POxaTDAj7aY.
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v2 [quant-ph] UPDATED)
    (2 min) Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications and also in the quantum autoencoder for data compression, showing successful results in all cases.
    Trajectory-Constrained Deep Latent Visual Attention for Improved Local Planning in Presence of Heterogeneous Terrain. (arXiv:2112.04684v2 [cs.RO] UPDATED)
    (2 min) We present a reward-predictive, model-based deep learning method featuring trajectory-constrained visual attention for use in mapless, local visual navigation tasks. Our method learns to place visual attention at locations in latent image space which follow trajectories caused by vehicle control actions to enhance predictive accuracy during planning. The attention model is jointly optimized by the task-specific loss and an additional trajectory-constraint loss, allowing adaptability yet encouraging a regularized structure for improved generalization and reliability. Importantly, visual attention is applied in latent feature map space instead of raw image space to promote efficient planning. We validated our model in visual navigation tasks of planning low turbulence, collision-free trajectories in off-road settings and hill climbing with locking differentials in the presence of slippery terrain. Experiments involved randomized procedural generated simulation and real-world environments. We found our method improved generalization and learning efficiency when compared to no-attention and self-attention alternatives.
    Center Smoothing: Certified Robustness for Networks with Structured Outputs. (arXiv:2102.09701v3 [cs.LG] UPDATED)
    (2 min) The study of provable adversarial robustness has mostly been limited to classification tasks and models with one-dimensional real-valued outputs. We extend the scope of certifiable robustness to problems with more general and structured outputs like sets, images, language, etc. We model the output space as a metric space under a distance/similarity function, such as intersection-over-union, perceptual similarity, total variation distance, etc. Such models are used in many machine learning problems like image segmentation, object detection, generative models, image/audio-to-text systems, etc. Based on a robustness technique called randomized smoothing, our $\textit{center smoothing}$ procedure can produce models with the guarantee that the change in the output, as measured by the distance metric, remains small for any norm-bounded adversarial perturbation of the input. We apply our method to create certifiably robust models with disparate output spaces - from sets to images - and show that it yields meaningful certificates without significantly degrading the performance of the base model. Code for our experiments is available at: https://github.com/aounon/center-smoothing.
    Learning Domain Invariant Representations for Generalizable Person Re-Identification. (arXiv:2103.15890v3 [cs.CV] UPDATED)
    (2 min) Generalizable person Re-Identification (ReID) has attracted growing attention in recent computer vision community. In this work, we construct a structural causal model among identity labels, identity-specific factors (clothes/shoes color etc), and domain-specific factors (background, viewpoints etc). According to the causal analysis, we propose a novel Domain Invariant Representation Learning for generalizable person Re-Identification (DIR-ReID) framework. Specifically, we first propose to disentangle the identity-specific and domain-specific feature spaces, based on which we propose an effective algorithmic implementation for backdoor adjustment, essentially serving as a causal intervention towards the SCM. Extensive experiments have been conducted, showing that DIR-ReID outperforms state-of-the-art methods on large-scale domain generalization ReID benchmarks.
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v2 [cond-mat.stat-mech] UPDATED)
    (2 min) We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We propose several estimates of autocorrelation times, some inspired by analytical results derived for the Metropolized Independent Sampler, which we compare and study as a function of inverse temperature $\beta$. Based on that we propose an alternative loss function and study its impact on the autocorelation times. Furthermore, we investigate the impact of imposing system symmetries ($Z_2$ and/or translational) in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guide to the implementation of Neural Markov Chain Monte Carlo simulations of more complicated models.
    On Calibration and Out-of-domain Generalization. (arXiv:2102.10395v4 [cs.LG] UPDATED)
    (2 min) Out-of-domain (OOD) generalization is a significant challenge for machine learning models. Many techniques have been proposed to overcome this challenge, often focused on learning models with certain invariance properties. In this work, we draw a link between OOD performance and model calibration, arguing that calibration across multiple domains can be viewed as a special case of an invariant representation leading to better OOD generalization. Specifically, we show that under certain conditions, models which achieve \emph{multi-domain calibration} are provably free of spurious correlations. This leads us to propose multi-domain calibration as a measurable and trainable surrogate for the OOD performance of a classifier. We therefore introduce methods that are easy to apply and allow practitioners to improve multi-domain calibration by training or modifying an existing model, leading to better performance on unseen domains. Using four datasets from the recently proposed WILDS OOD benchmark, as well as the Colored MNIST dataset, we demonstrate that training or tuning models so they are calibrated across multiple domains leads to significantly improved performance on unseen test domains. We believe this intriguing connection between calibration and OOD generalization is promising from both a practical and theoretical point of view.
    Holistic Deep Learning. (arXiv:2110.15829v2 [cs.LG] UPDATED)
    (2 min) There is much interest in deep learning to solve challenges that arise in applying neural network models in real-world environments. In particular, three areas have received considerable attention: adversarial robustness, parameter sparsity, and output stability. Despite numerous attempts on solving these problems independently, there is very little work addressing the challenges simultaneously. In this paper, we address this problem of constructing holistic deep learning models by proposing a novel formulation that solves these issues in combination. Real-world experiments on both tabular and MNIST dataset show that our formulation is able to simultaneously improve the accuracy, robustness, stability, and sparsity over traditional deep learning models among many others.
    ColO-RAN: Developing Machine Learning-based xApps for Open RAN Closed-loop Control on Programmable Experimental Platforms. (arXiv:2112.09559v2 [cs.NI] UPDATED)
    (2 min) In spite of the new opportunities brought about by the Open RAN, advances in ML-based network automation have been slow, mainly because of the unavailability of large-scale datasets and experimental testing infrastructure. This slows down the development and widespread adoption of Deep Reinforcement Learning (DRL) agents on real networks, delaying progress in intelligent and autonomous RAN control. In this paper, we address these challenges by proposing practical solutions and software pipelines for the design, training, testing, and experimental evaluation of DRL-based closed-loop control in the Open RAN. We introduce ColO-RAN, the first publicly-available large-scale O-RAN testing framework with software-defined radios-in-the-loop. Building on the scale and computational capabilities of the Colosseum wireless network emulator, ColO-RAN enables ML research at scale using O-RAN components, programmable base stations, and a "wireless data factory". Specifically, we design and develop three exemplary xApps for DRL-based control of RAN slicing, scheduling and online model training, and evaluate their performance on a cellular network with 7 softwarized base stations and 42 users. Finally, we showcase the portability of ColO-RAN to different platforms by deploying it on Arena, an indoor programmable testbed. Extensive results from our first-of-its-kind large-scale evaluation highlight the benefits and challenges of DRL-based adaptive control. They also provide insights on the development of wireless DRL pipelines, from data analysis to the design of DRL agents, and on the tradeoffs associated to training on a live RAN. ColO-RAN and the collected large-scale dataset will be made publicly available to the research community.
    Entity-Based Knowledge Conflicts in Question Answering. (arXiv:2109.05052v2 [cs.CL] UPDATED)
    (2 min) Knowledge-dependent tasks typically use two sources of knowledge: parametric, learned at training time, and contextual, given as a passage at inference time. To understand how models use these sources together, we formalize the problem of knowledge conflicts, where the contextual information contradicts the learned information. Analyzing the behaviour of popular models, we measure their over-reliance on memorized information (the cause of hallucinations), and uncover important factors that exacerbate this behaviour. Lastly, we propose a simple method to mitigate over-reliance on parametric knowledge, which minimizes hallucination, and improves out-of-distribution generalization by 4%-7%. Our findings demonstrate the importance for practitioners to evaluate model tendency to hallucinate rather than read, and show that our mitigation strategy encourages generalization to evolving information (i.e., time-dependent queries). To encourage these practices, we have released our framework for generating knowledge conflicts.
    Skin3D: Detection and Longitudinal Tracking of Pigmented Skin Lesions in 3D Total-Body Textured Meshes. (arXiv:2105.00374v2 [cs.CV] UPDATED)
    (2 min) We present an automated approach to detect and longitudinally track skin lesions on 3D total-body skin surface scans. The acquired 3D mesh of the subject is unwrapped to a 2D texture image, where a trained objected detection model, Faster R-CNN, localizes the lesions within the 2D domain. These detected skin lesions are mapped back to the 3D surface of the subject and, for subjects imaged multiple times, we construct a graph-based matching procedure to longitudinally track lesions that considers the anatomical correspondences among pairs of meshes and the geodesic proximity of corresponding lesions and the inter-lesion geodesic distances. We evaluated the proposed approach using 3DBodyTex, a publicly available dataset composed of 3D scans imaging the coloured skin (textured meshes) of 200 human subjects. We manually annotated locations that appeared to the human eye to contain a pigmented skin lesion as well as tracked a subset of lesions occurring on the same subject imaged in different poses. Our results, when compared to three human annotators, suggest that the trained Faster R-CNN detects lesions at a similar performance level as the human annotators. Our lesion tracking algorithm achieves an average matching accuracy of 88% on a set of detected corresponding pairs of prominent lesions of subjects imaged in different poses, and an average longitudinal accuracy of 71% when encompassing additional errors due to lesion detection. As there currently is no other large-scale publicly available dataset of 3D total-body skin lesions, we publicly release over 25,000 3DBodyTex manual annotations, which we hope will further research on total-body skin lesion analysis.
    Curriculum Offline Imitation Learning. (arXiv:2111.02056v2 [cs.LG] UPDATED)
    (2 min) Offline reinforcement learning (RL) tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment. Despite the potential to surpass the behavioral policies, RL-based methods are generally impractical due to the training instability and bootstrapping the extrapolation errors, which always require careful hyperparameter tuning via online evaluation. In contrast, offline imitation learning (IL) has no such issues since it learns the policy directly without estimating the value function by bootstrapping. However, IL is usually limited in the capability of the behavioral policy and tends to learn a mediocre behavior from the dataset collected by the mixture of policies. In this paper, we aim to take advantage of IL but mitigate such a drawback. Observing that behavior cloning is able to imitate neighboring policies with less data, we propose \textit{Curriculum Offline Imitation Learning (COIL)}, which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return, and improves the current policy along curriculum stages. On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
    Testing Self-Organized Criticality Across the Main Sequence using Stellar Flares from TESS. (arXiv:2109.07011v2 [astro-ph.SR] UPDATED)
    (2 min) Self-organized criticality describes a class of dynamical systems that maintain themselves in an attractor state with no intrinsic length or time scale. Fundamentally, this theoretical construct requires a mechanism for instability that may trigger additional instabilities locally via dissipative processes. This concept has been invoked to explain nonlinear dynamical phenomena such as featureless energy spectra that have been observed empirically for earthquakes, avalanches, and solar flares. If this interpretation proves correct, it implies that the solar coronal magnetic field maintains itself in a critical state via a delicate balance between the dynamo-driven injection of magnetic energy and the release of that energy via flaring events. All-sky high-cadence surveys like the Transiting Exoplanet Survey Satellite (TESS) provide the necessary data to compare the energy distribution of flaring events in stars of different spectral types to that observed in the Sun. We identified $\sim 10^6$ flaring events on $\sim 10^5$ stars observed by TESS at 2-minute cadence. By fitting the flare frequency distribution for different mass bins, we find that all main sequence stars exhibit distributions of flaring events similar to that observed in the Sun, independent of their mass or age. This may suggest that stars universally maintain a critical state in their coronal topologies via magnetic reconnection events. If this interpretation proves correct, we may be able to infer properties of magnetic fields, interior structure, and dynamo mechanisms for stars that are otherwise unresolved point sources.
    Game Theory for Adversarial Attacks and Defenses. (arXiv:2110.06166v3 [cs.LG] UPDATED)
    (2 min) Adversarial attacks can generate adversarial inputs by applying small but intentionally worst-case perturbations to samples from the dataset, which leads to even state-of-the-art deep neural networks outputting incorrect answers with high confidence. Hence, some adversarial defense techniques are developed to improve the security and robustness of the models and avoid them being attacked. Gradually, a game-like competition between attackers and defenders formed, in which both players would attempt to play their best strategies against each other while maximizing their own payoffs. To solve the game, each player would choose an optimal strategy against the opponent based on the prediction of the opponent's strategy choice. In this work, we are on the defensive side to apply game-theoretic approaches on defending against attacks. We use two randomization methods, random initialization and stochastic activation pruning, to create diversity of networks. Furthermore, we use one denoising technique, super resolution, to improve models' robustness by preprocessing images before attacks. Our experimental results indicate that those three methods can effectively improve the robustness of deep-learning neural networks.
    Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL. (arXiv:2106.05087v3 [cs.LG] UPDATED)
    (2 min) Evaluating the worst-case performance of a reinforcement learning (RL) agent under the strongest/optimal adversarial perturbations on state observations (within some constraints) is crucial for understanding the robustness of RL agents. However, finding the optimal adversary is challenging, in terms of both whether we can find the optimal attack and how efficiently we can find it. Existing works on adversarial RL either use heuristics-based methods that may not find the strongest adversary, or directly train an RL-based adversary by treating the agent as a part of the environment, which can find the optimal adversary but may become intractable in a large state space. This paper introduces a novel attacking method to find the optimal attacks through collaboration between a designed function named "actor" and an RL-based learner named "director". The actor crafts state perturbations for a given policy perturbation direction, and the director learns to propose the best policy perturbation directions. Our proposed algorithm, PA-AD, is theoretically optimal and significantly more efficient than prior RL-based works in environments with large state spaces. Empirical results show that our proposed PA-AD universally outperforms state-of-the-art attacking methods in various Atari and MuJoCo environments. By applying PA-AD to adversarial training, we achieve state-of-the-art empirical robustness in multiple tasks under strong adversaries.
    GraphVAMPNet, using graph neural networks and variational approach to markov processes for dynamical modeling of biomolecules. (arXiv:2201.04609v1 [physics.comp-ph])
    (2 min) Finding low dimensional representation of data from long-timescale trajectories of biomolecular processes such as protein-folding or ligand-receptor binding is of fundamental importance and kinetic models such as Markov modeling have proven useful in describing the kinetics of these systems. Recently, an unsupervised machine learning technique called VAMPNet was introduced to learn the low dimensional representation and linear dynamical model in an end-to-end manner. VAMPNet is based on variational approach to Markov processes (VAMP) and relies on neural networks to learn the coarse-grained dynamics. In this contribution, we combine VAMPNet and graph neural networks to generate an end-to-end framework to efficiently learn high-level dynamics and metastable states from the long-timescale molecular dynamics trajectories. This method bears the advantages of graph representation learning and uses graph message passing operations to generate an embedding for each datapoint which is used in the VAMPNet to generate a coarse-grained representation. This type of molecular representation results in a higher resolution and more interpretable Markov model than the standard VAMPNet enabling a more detailed kinetic study of the biomolecular processes. Our GraphVAMPNet approach is also enhanced with an attention mechanism to find the important residues for classification into different metastable states.
    Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning. (arXiv:2201.04612v1 [cs.MA])
    (2 min) This paper considers multi-agent reinforcement learning (MARL) tasks where agents receive a shared global reward at the end of an episode. The delayed nature of this reward affects the ability of the agents to assess the quality of their actions at intermediate time-steps. This paper focuses on developing methods to learn a temporal redistribution of the episodic reward to obtain a dense reward signal. Solving such MARL problems requires addressing two challenges: identifying (1) relative importance of states along the length of an episode (along time), and (2) relative importance of individual agents' states at any single time-step (among agents). In this paper, we introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these two challenges. AREL uses attention mechanisms to characterize the influence of actions on state transitions along trajectories (temporal attention), and how each agent is affected by other agents at each time-step (agent attention). The redistributed rewards predicted by AREL are dense, and can be integrated with any given MARL algorithm. We evaluate AREL on challenging tasks from the Particle World environment and the StarCraft Multi-Agent Challenge. AREL results in higher rewards in Particle World, and improved win rates in StarCraft compared to three state-of-the-art reward redistribution methods. Our code is available at https://github.com/baicenxiao/AREL.
  • stat.ML updates on arXiv.org

    Fast Online Changepoint Detection via Functional Pruning CUSUM statistics. (arXiv:2110.08205v2 [stat.ME] UPDATED)
    (2 min) Many modern applications of online changepoint detection require the ability to process high-frequency observations, sometimes with limited available computational resources. Online algorithms for detecting a change in mean often involve using a moving window, or specifying the expected size of change. Such choices affect which changes the algorithms have most power to detect. We introduce an algorithm, Functional Online CuSUM (FOCuS), which is equivalent to running these earlier methods simultaneously for all sizes of window, or all possible values for the size of change. Our theoretical results give tight bounds on the expected computational cost per iteration of FOCuS, with this being logarithmic in the number of observations. We show how FOCuS can be applied to a number of different change in mean scenarios, and demonstrate its practical utility through its state-of-the art performance at detecting anomalous behaviour in computer server data.
    Embedding Principle of Loss Landscape of Deep Neural Networks. (arXiv:2105.14573v3 [cs.LG] UPDATED)
    (2 min) Understanding the structure of loss landscape of deep neural networks (DNNs)is obviously important. In this work, we prove an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/hyperplane of the target DNN with higher degeneracy and preserving the DNN output function. The embedding structure of critical points is independent of loss function and training data, showing a stark difference from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides an explanation for the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, our work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near
    Manifold learning via quantum dynamics. (arXiv:2112.11161v2 [quant-ph] UPDATED)
    (2 min) We introduce an algorithm for computing geodesics on sampled manifolds that relies on simulation of quantum dynamics on a graph embedding of the sampled data. Our approach exploits classic results in semiclassical analysis and the quantum-classical correspondence, and forms a basis for techniques to learn the manifold from which a dataset is sampled, and subsequently for nonlinear dimensionality reduction of high-dimensional datasets. We illustrate the new algorithm with data sampled from model manifolds and also by a clustering demonstration based on COVID-19 mobility data. Finally, our method reveals interesting connections between the discretization provided by data sampling and quantization.
    When Random Tensors meet Random Matrices. (arXiv:2112.12348v2 [math.PR] UPDATED)
    (2 min) Relying on random matrix theory (RMT), this paper studies asymmetric order-$d$ spiked tensor models with Gaussian noise. Using the variational definition of the singular vectors and values of (Lim, 2005), we show that the analysis of the considered model boils down to the analysis of an equivalent spiked symmetric block-wise random matrix, that is constructed from contractions of the studied tensor with the singular vectors associated to its best rank-1 approximation. Our approach allows the exact characterization of the almost sure asymptotic singular value and alignments of the corresponding singular vectors with the true spike components, when $\frac{n_i}{\sum_{j=1}^d n_j}\to c_i\in [0, 1]$ with $n_i$'s the tensor dimensions. In contrast to other works that rely mostly on tools from statistical physics to study random tensors, our results rely solely on classical RMT tools such as Stein's lemma. Finally, classical RMT results concerning spiked random matrices are recovered as a particular case.
    Test for non-negligible adverse shifts. (arXiv:2107.02990v3 [stat.ML] UPDATED)
    (2 min) Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences when there is in fact adequate sample coverage and predictive performance. We propose instead a framework to detect adverse dataset shifts based on outlier scores, $\texttt{D-SOS}$ for short. $\texttt{D-SOS}$ holds that the new (test) sample is not substantively worse than the reference (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates at varying weighted thresholds. Users can define what $\it{worse}$ means in terms of relevant notions of outlyingness, including proxies for predictive performance. Compared to tests of equal distribution, our approach is uniquely tailored to serve as a robust metric for model monitoring and data validation. We show how versatile and practical $\texttt{D-SOS}$ is on a wide range of real and simulated data.
    An Embedded Model Estimator for Non-Stationary Random Functions using Multiple Secondary Variables. (arXiv:2011.04116v4 [stat.ME] UPDATED)
    (2 min) An algorithm for non-stationary spatial modelling using multiple secondary variables is developed. It combines Geostatistics with Quantile Random Forests to give a new interpolation and stochastic simulation algorithm. This paper introduces the method and shows that it has consistency results that are similar in nature to those applying to geostatistical modelling and to Quantile Random Forests. The method allows for embedding of simpler interpolation techniques, such as Kriging, to further condition the model. The algorithm works by estimating a conditional distribution for the target variable at each target location. The family of such distributions is called the envelope of the target variable. From this, it is possible to obtain spatial estimates, quantiles and uncertainty. An algorithm to produce conditional simulations from the envelope is also developed. As they sample from the envelope, realizations are therefore locally influenced by relative changes of importance of secondary variables, trends and variability.
    Interacting with Explanations through Critiquing. (arXiv:2005.11067v4 [cs.CL] UPDATED)
    (2 min) Using personalized explanations to support recommendations has been shown to increase trust and perceived quality. However, to actually obtain better recommendations, there needs to be a means for users to modify the recommendation criteria by interacting with the explanation. We present a novel technique using aspect markers that learns to generate personalized explanations of recommendations from review texts, and we show that human users significantly prefer these explanations over those produced by state-of-the-art techniques. Our work's most important innovation is that it allows users to react to a recommendation by critiquing the textual explanation: removing (symmetrically adding) certain aspects they dislike or that are no longer relevant (symmetrically that are of interest). The system updates its user model and the resulting recommendations according to the critique. This is based on a novel unsupervised critiquing method for single- and multi-step critiquing with textual explanations. Experiments on two real-world datasets show that our system is the first to achieve good performance in adapting to the preferences expressed in multi-step critiquing.
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v2 [quant-ph] UPDATED)
    (2 min) Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications and also in the quantum autoencoder for data compression, showing successful results in all cases.
    Eigenvalue Distribution of Large Random Matrices Arising in Deep Neural Networks: Orthogonal Case. (arXiv:2201.04543v1 [stat.ML])
    (2 min) The paper deals with the distribution of singular values of the input-output Jacobian of deep untrained neural networks in the limit of their infinite width. The Jacobian is the product of random matrices where the independent rectangular weight matrices alternate with diagonal matrices whose entries depend on the corresponding column of the nearest neighbor weight matrix. The problem was considered in \cite{Pe-Co:18} for the Gaussian weights and biases and also for the weights that are Haar distributed orthogonal matrices and Gaussian biases. Basing on a free probability argument, it was claimed that in these cases the singular value distribution of the Jacobian in the limit of infinite width (matrix size) coincides with that of the analog of the Jacobian with special random but weight independent diagonal matrices, the case well known in random matrix theory. The claim was rigorously proved in \cite{Pa-Sl:21} for a quite general class of weights and biases with i.i.d. (including Gaussian) entries by using a version of the techniques of random matrix theory. In this paper we use another version of the techniques to justify the claim for random Haar distributed weight matrices and Gaussian biases.
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v2 [cond-mat.stat-mech] UPDATED)
    (2 min) We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We propose several estimates of autocorrelation times, some inspired by analytical results derived for the Metropolized Independent Sampler, which we compare and study as a function of inverse temperature $\beta$. Based on that we propose an alternative loss function and study its impact on the autocorelation times. Furthermore, we investigate the impact of imposing system symmetries ($Z_2$ and/or translational) in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guide to the implementation of Neural Markov Chain Monte Carlo simulations of more complicated models.
    Deep Hedging: Learning to Remove the Drift under Trading Frictions with Minimal Equivalent Near-Martingale Measures. (arXiv:2111.07844v3 [q-fin.CP] UPDATED)
    (2 min) We present a machine learning approach for finding minimal equivalent martingale measures for markets simulators of tradable instruments, e.g. for a spot price and options written on the same underlying. We extend our results to markets with frictions, in which case we find "near-martingale measures" under which the prices of hedging instruments are martingales within their bid/ask spread. By removing the drift, we are then able to learn using Deep Hedging a "clean" hedge for an exotic payoff which is not polluted by the trading strategy trying to make money from statistical arbitrage opportunities. We correspondingly highlight the robustness of this hedge vs estimation error of the original market simulator. We discuss applications to two market simulators.
    Representation of Context-Specific Causal Models with Observational and Interventional Data. (arXiv:2101.09271v3 [math.ST] UPDATED)
    (2 min) We consider the problem of representing causal models that encode context-specific information for discrete data using a proper subclass of staged tree models which we call CStrees. We show that the context-specific information encoded by a CStree can be equivalently expressed via a collection of DAGs. As not all staged tree models admit this property, CStrees are a subclass that provides a transparent, intuitive and compact representation of context-specific causal information. We prove that CStrees admit a global Markov property which yields a graphical criterion for model equivalence generalizing that of Verma and Pearl for DAG models. These results extend to the general interventional model setting, making CStrees the first family of context-specific models admitting a characterization of interventional model equivalence. We also provide a closed-form formula for the maximum likelihood estimator of a CStree and use it to show that the Bayesian information criterion is a locally consistent score function for this model class. The performance of CStrees is analyzed on both simulated and real data, where we see that modeling with CStrees instead of general staged trees does not result in a significant loss of predictive accuracy, while affording DAG representations of context-specific causal information.
    On generalization bounds for deep networks based on loss surface implicit regularization. (arXiv:2201.04545v1 [stat.ML])
    (2 min) The classical statistical learning theory says that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. The implicit regularization induced by stochastic gradient descent (SGD) has been regarded to be important, but its specific principle is still unknown. In this work, we study how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima. Our work attempts to better connect non-convex optimization and generalization analysis with uniform convergence.
    Fighting Money-Laundering with Statistics and Machine Learning: An Introduction and Review. (arXiv:2201.04207v1 [stat.ML])
    (2 min) Money laundering is a profound, global problem. Nonetheless, there is little statistical and machine learning research on the topic. In this paper, we focus on anti-money laundering in banks. To help organize existing research in the field, we propose a unifying terminology and provide a review of the literature. This is structured around two central tasks: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. Suspicious behavior flagging, on the other hand, is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is the lack of public data sets. This may, potentially, be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability and fairness of the results.
    Improving Disentangled Text Representation Learning with Information-Theoretic Guidance. (arXiv:2006.00693v3 [cs.LG] UPDATED)
    (2 min) Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.
    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. (arXiv:2201.04234v1 [cs.LG])
    (2 min) Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works.
    Optimal Fixed-Budget Best Arm Identification using the Augmented Inverse Probability Estimator in Two-Armed Gaussian Bandits with Unknown Variances. (arXiv:2201.04469v1 [stat.ML])
    (2 min) We consider the fixed-budget best arm identification problem in two-armed Gaussian bandits with unknown variances. The tightest lower bound on the complexity and an algorithm whose performance guarantee matches the lower bound have long been open problems when the variances are unknown and when the algorithm is agnostic to the optimal proportion of the arm draws. In this paper, we propose a strategy comprising a sampling rule with randomized sampling (RS) following the estimated target allocation probabilities of arm draws and a recommendation rule using the augmented inverse probability weighting (AIPW) estimator, which is often used in the causal inference literature. We refer to our strategy as the RS-AIPW strategy. In the theoretical analysis, we first derive a large deviation principle for martingales, which can be used when the second moment converges in mean, and apply it to our proposed strategy. Then, we show that the proposed strategy is asymptotically optimal in the sense that the probability of misidentification achieves the lower bound by Kaufmann et al. (2016) when the sample size becomes infinitely large and the gap between the two arms goes to zero.
  • Google AI Blog

    Learning to Route by Task for Efficient Inference
    (7 min) Posted by Sneha Kudugunta, Research Software Engineer and Orhan Firat, Research Scientist, Google Research Scaling large language models has resulted in significant quality improvements natural language understanding (T5), generation (GPT-3) and multilingual neural machine translation (M4). One common approach to building a larger model is to increase the depth (number of layers) and width (layer dimensionality), simply enlarging existing dimensions of the network. Such dense models take an input sequence (divided into smaller components, called tokens) and pass every token through the full network, activating every layer and parameter. While these large, dense models have achieved state-of-the-art results on multiple natural language processing (NLP) tasks, their training cost increase…
  • The Official NVIDIA Blog

    From Imagination to Animation, How an Omniverse Creator Makes Films Virtually
    (3 min) Growing up in the Philippines, award-winning filmmaker Jae Solina says he turned to movies for a reminder that the world was much larger than himself and his homeland. The post From Imagination to Animation, How an Omniverse Creator Makes Films Virtually appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    “Hey, Alexa! Are you trustworthy?”
    (7 min) The more social behaviors a voice-user interface exhibits, the more likely people are to trust it, engage with it, and consider it to be competent.
  • AWS Machine Learning Blog

    Label text for aspect-based sentiment analysis using SageMaker Ground Truth
    (13 min) The Amazon Machine Learning Solutions Lab (MLSL) recently created a tool for annotating text with named-entity recognition (NER) and relationship labels using Amazon SageMaker Ground Truth. Annotators use this tool to label text with named entities and link their relationships, thereby building a dataset for training state-of-the-art natural language processing (NLP) machine learning (ML) models. Most […]
  • Artificial Intelligence

    Looking for a text to speech AI
    (0 min) Hi, so I was looking for a text to speech, AI, much like natural readers or murf. Do you guys have any idea of one that is as good as them but is free? submitted by /u/ArturEPinheiro777 [link] [comments]
    Scientific Literature Review Generation v1.0
    (1 min) Hello , I've developed after my PhD a first version of an algorithm to automatically generate a literature review : https://www.naimai.fr and many remarks were given. I just deployed a new version with much more papers and I'll be thankful if you have any remarks about it :) More about the new version here : https://yaassinekaddi.medium.com/scientific-literature-generation-ii-73628aebd4fb Hopefully that could be useful for the PhDs (and the non PhDs) ! ​ Cheers, submitted by /u/Cyalas [link] [comments]
    Searching participants for art project
    (1 min) Hi, I’m part of an art group from Switzerland currently studying at HSLU Design & Arts (https://www.hslu.ch/de-ch/design-kunst/studium/bachelor/camera-arts/). The group consists of: Karim Beji (https://www.instagram.com/karimbeji_/ https://karimbeji.ch/) Emanuel Bohnenblust (https://www.instagram.com/e.bohnenblust/) Lea Karabash (https://www.instagram.com/leakarabashian/) Yen Shih-hsuan (https://www.instagram.com/shixuan.yan/ http://syen.hfk-bremen.de/) At the moment, we are working on a project on the topic if AI can augment the happiness of humans. To answer this question, we are mainly working with chatbots. The end result is going to be an exhibition at the end of March. For that exhibition, we want to conduct a trial in which people from over the world chat with a chatbot to find out if and how it augments the mood of the participants. We would give you access to a GPT-3 (OpenAI) chatbot and ask you to a) record yourself through a webcam (laptop) while you are chatting and b) simultaneously screen record the chat window. In the exhibition we would have a) a book with all the chats and b) small videos with your faces (webcam) to assess your mood. We would have a Zoom meeting beforehand to discuss everything. Looking forward to your message! submitted by /u/Nebeldiener [link] [comments]
    Are there any A.I. projects being developed that turn 2D (abstract) art into complex composition 3D models according to specs and dataset training of the A.I.? Is there any plugin like this for a 3D app?
    (1 min) A plugin that recognizes dots on a 2D picture to pull on them and make line predictions other things. Not mech engineering apps or apps that turn a set of 2D photos into 3D. submitted by /u/Niu_Davinci [link] [comments]
    Can you complete this short survey on the applications of conversational AI? I'd be much obliged 😅
    (0 min) submitted by /u/KazRainer [link] [comments]
    Growth Opportunities for Global Artificial Intelligence in Automotive
    (1 min) This research service examines the role artificial intelligence (AI) will play in the transformation of the automotive space. AI is a key disruptive technology, wherein automakers are evolving into technology firms and expanding their service offerings beyond manufacturing vehicles. Technology implementation has increased, and the post-pandemic situation appears to be positive for all stakeholders; however, automakers have yet to fully harness AI’s potential in their service offerings. Although AI is in the nascent stage of development, OEMs are adopting it across the automotive value chain to improve manufacturing and to enhance customer experience, marketing, sales, and after-sales services. This report examines use cases and business opportunity areas for various players in the automotive ecosystem, including OEMs, Tier I suppliers and technology service providers, and new entrants or start-ups. As the industry continues to evolve, AI capabilities will become the core of automotive solutions. The study identifies key AI trends impacting the industry, including the convergence of connectivity, autonomous, sharing/subscription, and electrification (CASE) ​ https://uk.sports.yahoo.com/news/growth-opportunities-global-artificial-intelligence-101800220.html?guccounter=1 submitted by /u/ai_datascience [link] [comments]
    Researchers From NVIDIA & Vanderbilt University Propose ‘Swin UNETR’: A Novel Architecture for Semantic Segmentation of Brain Tumors Using Multi-Modal MRI Images
    (1 min) The human brain is affected by about 120 different forms of brain tumors. AI-based intervention for tumor identification and surgical pre-assessment is on the verge of becoming a necessity rather than a luxury as we approach the era of artificial intelligence (AI) in healthcare. Brain tumors can be characterized in depth using techniques like volumetric analysis to track their growth and aid in pre-surgical planning. The characterization of defined tumors can be directly used for the prognosis of life expectancy, in addition to surgical uses. The segmentation of brain tumors is at the forefront of all such applications. Primary and secondary tumors are the two forms of brain tumors. Primary brain cancers arise from brain cells, whereas secondary tumors spread from other organs to the brain. Continue Reading Paper: https://arxiv.org/pdf/2201.01266v1.pdf Github: https://github.com/Project-MONAI/research-contributions/tree/master/SwinUNETR ​ https://preview.redd.it/tdmx86qgxkb81.png?width=996&format=png&auto=webp&s=59fe21d046e4fcba6b9ae4dd38890512757b8ffb submitted by /u/techsucker [link] [comments]
    Image to Text A.I??
    (1 min) Like the title says, I am looking to see if there is an A.I. that will generate a string of text based on what it thinks it sees in a given image (Eg./ [image of human DNA strand], and the A.I would type something like "image of the human genome, encoded as DNA, then list off the parts of the DNA strand [nuclei etc...].) I know text to image already exists but I cannot seem to find anything like this, even a more simple version that can process planets or animals or anything, thanks! submitted by /u/FreeOfGovernment [link] [comments]
  • Machine Learning

    [P] Sentiment Analysis on 7,000+ Words?
    (1 min) Alright, so quick disclaimer: I'm a high-schooler, and probably biting off more than I can chew, but I'm trying this either way, so please excuse any stupid questions. I'm looking to do sentiment analysis on a number of 7,000+ word long reports that have been released over a period of years to look at the change in sentiment. From what I've done and from what I've seen others do, most sentiment analysis is on text that is much, much shorter (~sentence length). Is there any model that can take text of that length? If not, my other thought was to split each report up into individual sentences, feed those through the model, and average the sentiment for each report. Could that work? Or would the results be utter crap? Are there better ways of tackling this issue? Any advice or information would be appreciated, thank you all so much! submitted by /u/Blue_Jay89 [link] [comments]
    [D] How is the MDP Homomorphism approach in the below paper different from RL with an Embedding Space like for example CURL?
    (1 min) I'm reading: Plannable Approximations to MDP Homomorphisms: Equivariance under Actions (arxiv: https://arxiv.org/pdf/2002.11963.pdf) and I'm not understanding how this approach differs from RL training in embedding space, for example, CURL. I do understand that they seem to be training both in the base space and the embedding space, but I'm not sure what this buys them. Thanks for your help! submitted by /u/www3cam [link] [comments]
    [Discussion] What type of a ML community member are you?
    (1 min) Hi all! I was thinking about how easy it is to follow Machine Learning as a non-professional in a such a depth as to have a reasonable/functional understanding of it, and got curious about what are the types of users in this community. If you are not a professional, don't plan to be, or you selected Other, feel free to elaborate about your relationship with ML in the comments. View Poll submitted by /u/TsirixtoVatraxi [link] [comments]
    [P] Back to back YOLO?
    (1 min) I'm working on a real-time image recognition project (shocker) and want to identify main classes (Car, sign, tree) as well as subclasses within them ((Audi, BMW, Ford), (Stop, Yield, Speed Limit), (Pine, Douglas, Maple)). It would be cool to do sub-sub classes, like ID Sign->Speed Limit->45MPH, but that might be extreme. I was thinking running YOLO to do the main class and then piping the results into separate YOLOs might work, but can't find much on how that might be done or if it's even a good idea. Any guidance would be a great help!! submitted by /u/CosmosFood [link] [comments]
    [D] What framework do you do your data processing (small and large datasets) in?
    (1 min) Hey all! I'm curious to know what everyone does their data processing in? For industry purposes on smaller datasets I've used Pandas and sklearn, while for larger ones I've used Dask. I know that Tensorflow and Pytorch have their own dataloader frameworks but does it scale to large datasets? Say 100GB++? Do people do data transformations in Jax? if you had infinite time how would you (re)do your company tech stack? My data is 85% timeseries, 10% images, and 5% tabular data. submitted by /u/iamquah [link] [comments]
    [P] Scientific Literature Review Generation v1.0
    (1 min) Hello , I've developed after my PhD a first version of an algorithm to automatically generate a literature review : https://www.naimai.fr and many remarks were given. I just deployed a new version with much more papers and I'll be thankful if you have any remarks about it :) More about the new version here : https://yaassinekaddi.medium.com/scientific-literature-generation-ii-73628aebd4fb Hopefully that could be useful for the PhDs (and the non PhDs) ! ​ Cheers, submitted by /u/Cyalas [link] [comments]
    [D] Reinforcement Learning As A Fine-Tuning Paradigm
    (0 min) https://ankeshanand.com/blog/2022/01/08/rl-fine-tuning.html ​ Reinforcement Learning (RL) should be better seen as a “fine-tuning” paradigm that can add capabilities to general-purpose pretrained models, rather than a paradigm that can bootstrap intelligence from scratch. submitted by /u/EducationalCicada [link] [comments]
    [D] Leading Research Groups/ Researchers in AI Robotics & Theoretical RL
    (1 min) Hi, I would like to know who are the current leading researchers/ research groups in AI Robotics and other in Theoretical Reinforcement Learning. Groups I know: Theoretical Reinforcement Learning - Rich Sutton's Lab at UofAlberta AI Robotics Groups: Stanford AI Lab Sergey Levine's Lab I want to make a comprehensive list of this and would really appreciate any kind of help here.Thanks. submitted by /u/EqualMarsupial2759 [link] [comments]
    [R] GradMax: Growing Neural Networks using Gradient Information
    (0 min) submitted by /u/hardmaru [link] [comments]
    [N] Machine Learning for Audio with Hugging Face
    (3 min) Hi! Over the last year at Hugging Face we’ve been working quite a bit to enable people to take advantage of state-of-the-art ML methods for audio, and we’re trying to get the word out about our efforts. Meh, I’m not a 🤗 Transformers user, why should I care? We’re not just building stuff for 🤗 Transformers users! We have built integrations with open-source libraries such as speechbrain, pyannote, ESPnet, Asteroid and Facebook Fairseq as well as dataset tools. Some example models you can try directly in the browser: FastSpeech 2 for Text to Speech pyannote Audio Segmentation SpeechBrain Language Identification model Facebook XLS-R demo to translate your audio from any language to any other target language Where’s the data? We’re working on improving audio support in the datase…
    [D] What is the most accurate open source ASR library with multiple language support?
    (1 min) I am working on one project involving free speech and I was wondering what your opinions are what are the best open source libraries that can do the transcription job well. (I use python and the languages are German and Dutch apart from English ). I heard about NeMo from Nvidia, however currently they do not support Dutch? I know that sometimes solutions are not stable and of course it relies on the quality of the recording a lot, but assuming the good quality of the record, etc. submitted by /u/MrJacobJohnson [link] [comments]
    [R] Bayesian additive regression trees and the General BART model
    (1 min) tl;dr: There are a variety of BART models for different kinds of data, this article reviews a bunch of them. abstract: "Bayesian additive regression trees (BART) is a flexible prediction model/machine learning approach that has gained widespread popularity in recent years. As BART becomes more mainstream, there is an increased need for a paper that walks readers through the details of BART, from what it is to why it works. This tutorial is aimed at providing such a resource. In addition to explaining the different components of BART using simple examples, we also discuss a framework, the General BART model, that unifies some of the recent BART extensions, including semiparametric models, correlated outcomes, statistical matching problems in surveys, and models with weaker distributional assumptions. By showing how these models fit into a single framework, we hope to demonstrate a simple way of applying BART to research problems that go beyond the original independent continuous or binary outcomes framework. " paper: https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8347 arxiv'd: https://arxiv.org/abs/1901.07504 submitted by /u/bikeskata [link] [comments]
    [P] - Tutorial: Time-Series Transformer + Time2Vec
    (4 min) [Tutorial] Transformers for Time-Series (Time2Vec) How to use transformers for Time-series data? 🤔 TLDR: The notebook Rolling Window Setup Our data comes in a table. We want to model a series. (which is not a table) We got a problem. In order to solve our problem we will transform the table to the following set-up: Each of our input vectors will be a slice of the dataset representing a rolling-window of N times-steps, For each of these vectors we will also slice a target vector corresponding to the indices of the rolling-window. In short, we create the following input and output: X shape: (samples, time-steps, features) y shape: (samples, 1) Figure: https://i.ibb.co/ZfsK9ZY/1-q-6btr-S9-S-mc-JDdc-AW-It-A.png In order for us to get a better understanding, let…
    [N] AugeLab Studio - Machine Vision made easy
    (1 min) Hello all, ​ I am one of the founders and senior programmers in AugeLab Studio, a visual programming environment that focuses on rapid development and deployment of machine vision projects. Supports yolo, tensorflow, opencv, scikit and many libraries. Currently available in Windows Machines. Take a sneak peak: https://www.youtube.com/watch?v=GSy07Kdjf70 The concept is based on nodes and connecting them to create a flow that represents an algorithm. You can also write your own nodes and integrate them on the go. A simple measuring task can be done visually like below: ​ Measuring distance between contours You can download it for free from www.augelab.com. Professional version of the software which is available for students and researchers, includes object detection algorithms(pre-trained), training object detection and labeling tools, creating custom CNNs. ​ Feel free to ask anything. submitted by /u/Suspcious-chair [link] [comments]
    [D] Alternatives to SKLearn interface?
    (2 min) The fit/predict interface introduced by SKLearn have been really successful and have shaped the ML community in the past decade, but lately I am thinking about possible alternatives or extensions. The only consistent alternative I found is the Haiku init/apply, and in general the approach of some "pure" JAX projects. Do you know any alternative approch? Do you some ideas on what a good 'ML-grammar' should be? submitted by /u/abio93 [link] [comments]
    [R] Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?
    (0 min) submitted by /u/downtownslim [link] [comments]
    Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI
    (1 min) submitted by /u/ksinkar [link] [comments]
    [R] HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning
    (0 min) submitted by /u/hardmaru [link] [comments]
  • Reinforcement Learning

    Reinforcement Learning As A Fine-Tuning Paradigm
    (0 min) submitted by /u/EducationalCicada [link] [comments]
    Leading Research Groups/ Researchers in AI Robotics & Theoretical RL
    (1 min) Hi, I would like to know who are the current leading researchers/ research groups in AI Robotics and other in Theoretical Reinforcement Learning. Groups I know: Theoretical Reinforcement Learning - Rich Sutton's Lab at UofAlberta AI Robotics Groups: Stanford AI Lab Sergey Levine's Lab I want to make a comprehensive list of this and would really appreciate any kind of help here.Thanks. submitted by /u/EqualMarsupial2759 [link] [comments]
    Sub-optimal actions in a continuous setting
    (1 min) I am currently working on a simple battery management agent (PPO, SB3), which should optimize its charge/discharge under a variety of loads. The agent converges and its actions [-1:discharge entire battery, 1: charge entire battery] are pretty good, but the discharge never matches the load exactly (optimal), there is always a deviation of +-10% or so. Is this something inherent to the algorithm/RL or am I missing something elementary? What Hyperparameter have an influence on the action distribution and are worth experimenting with? I tried an env where there is a single obs[0,1] and where the agent needs to output that exact number as its action[0,1] for max rewards and it struggles to achieve this. (accurate to 3-4 decimal points, more accurate with lower learning rate) Input would be greatly appreciated! submitted by /u/throwaway582643766 [link] [comments]
    Comprehensive list of differences between value-based and policy-based DRL
    (1 min) I'm trying to create a comprehensive list of differences between value-based and policy-based DRL. 1 The main advantage of learning parameterized policies is that policies can now be any learnable function, whereas value-based methods, in the case of continuous action spaces, are severely limited. 2 Policy-based methods can more easily learn stochastic policies, which in turn has multiple additional advantages a) Better performance under partially observable environments because we can learn arbitrary probabilities of actions and thus the agent is less dependent on the Markov assumption. In value-based methods we have to force exploration with some probability to ensure optimality, whereas in policy-based methods with stochastic policies, exploration is embedded in the learned function, and converging to a deterministic policy for a given state while training is possible b) It could be more straightforward for function approximation to represent a policy than a value function. Sometimes value functions are too much information for what’s truly needed. 3 Because policies are parameterized with continuous values, the action probabilities change smoothly as a function of the learned parameters. Therefore, policy-based methods often have better convergence properties. 4 One of the main advantages of optimizing policies directly is that it’s the right objective. We learn a policy that optimizes the value function directly, without learning a value function, and without taking into account the dynamics of the environment. 5 The main advantage of value-based deep reinforcement learning is that they are (usually) off-policy methods and when they work, they tend to be much more sample efficient than policy optimization methods, because they can exploit data from experts or other sources. Do you guys agree with this? Can you think of other points I've surely missed? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    Multi-Agent RL in Computer Vision.
    (1 min) Can you note me some example ‏‏‎ where multi-agent RL framework is used to solve computer vision tasks like object detection, event detection, action recognition, tracking and so on. I am actually interested to know how multiple agents communicate/cooperate to solve a task. How the communication/ cooperation can be modeled. submitted by /u/Infamous-Editor5131 [link] [comments]
    In my Deep RL algorithm, my state representation is a vector of 'n' dimensions , which i need to get from k vectors of same 'n' dimensions each . What neural network technique should i use to obtain this state representation .
    (1 min) submitted by /u/aabra__ka__daabra [link] [comments]
    What is the difference between a batch and a buffer?
    (1 min) What is the difference between how NFQ grows a batch to optimize several samples at the same time and and DQN has a replay buffer that holds experience samples for several steps? submitted by /u/No_Possibility_7588 [link] [comments]
    How to integrate a trained supervised learning model into RL?
    (2 min) Hi, this might be a longshot but here we go: I have developed a very good hft classification price predictor implemented in tensorflow using some of the modern NLP approaches. HOWEVER, market making is a bit more tricky then that, hence I want to use a RL framework to develop a strategy around it (particularly with focus on inventory management) Does anybody know where to start? I've got a classification model which output (one of three classes) would work as an input signal for a RL strategy as well as the price change over time of the underlying. Metrics could be either PnL etc. I've got billions of data points too. Would be grateful if someone can point me into a good direction on where to start (I.e. combining trained SL models with RL) - or if I'm barking up the wrong tree completely. I Did Supervised learning for a long time, but not RL. Ideally i'd be looking for a smaller model. submitted by /u/stocktiger6969 [link] [comments]
    RL Reward in MATLAB
    (1 min) Hi guys, I have a dc motor Simulink in MATLAB that I want to control using RL(DQN). I implemented my agent, but it always gives me a constant reward. In the training step, the step changed but my reward is constant and gives me a constant reward. Even adding two observations gave me a constant reward while it completed the training steps. Does anybody know the reason for that? ​ ​ https://preview.redd.it/95iqswg3ulb81.png?width=771&format=png&auto=webp&s=5d499a9a0e633fd572c56405687ed31fbfb991a5 submitted by /u/Armin1371 [link] [comments]
    "Automated Reinforcement Learning (AutoRL): A Survey and Open Problems", Parker-Holder et al 2022
    (1 min) submitted by /u/gwern [link] [comments]

2022-01-13

  • cs.LG updates on arXiv.org

    FRIDA -- Generative Feature Replay for Incremental Domain Adaptation. (arXiv:2112.14316v2 [cs.CV] UPDATED)
    (2 min) We tackle the novel problem of incremental unsupervised domain adaptation (IDA) in this paper. We assume that a labeled source domain and different unlabeled target domains are incrementally observed with the constraint that data corresponding to the current domain is only available at a time. The goal is to preserve the accuracies for all the past domains while generalizing well for the current domain. The IDA setup suffers due to the abrupt differences among the domains and the unavailability of past data including the source domain. Inspired by the notion of generative feature replay, we propose a novel framework called Feature Replay based Incremental Domain Adaptation (FRIDA) which leverages a new incremental generative adversarial network (GAN) called domain-generic auxiliary classification GAN (DGAC-GAN) for producing domain-specific feature representations seamlessly. For domain alignment, we propose a simple extension of the popular domain adversarial neural network (DANN) called DANN-IB which encourages discriminative domain-invariant and task-relevant feature learning. Experimental results on Office-Home, Office-CalTech, and DomainNet datasets confirm that FRIDA maintains superior stability-plasticity trade-off than the literature.
    Universal Efficient Variable-rate Neural Image Compression. (arXiv:2111.11305v3 [eess.IV] UPDATED)
    (2 min) Recently, Learning-based image compression has reached comparable performance with traditional image codecs(such as JPEG, BPG, WebP). However, computational complexity and rate flexibility are still two major challenges for its practical deployment. To tackle these problems, this paper proposes two universal modules named Energy-based Channel Gating(ECG) and Bit-rate Modulator(BM), which can be directly embedded into existing end-to-end image compression models. ECG uses dynamic pruning to reduce FLOPs for more than 50\% in convolution layers, and a BM pair can modulate the latent representation to control the bit-rate in a channel-wise manner. By implementing these two modules, existing learning-based image codecs can obtain ability to output arbitrary bit-rate with a single model and reduced computation.
    Distilling the Knowledge of Romanian BERTs Using Multiple Teachers. (arXiv:2112.12650v2 [cs.CL] UPDATED)
    (2 min) Running large-scale pre-trained language models in computationally constrained environments remains a challenging problem yet to be addressed, while transfer learning from these models has become prevalent in Natural Language Processing tasks. Several solutions, including knowledge distillation, network quantization, or network pruning have been previously proposed; however, these approaches focus mostly on the English language, thus widening the gap when considering low-resource languages. In this work, we introduce three light and fast versions of distilled BERT models for the Romanian language: Distil-BERT-base-ro, Distil-RoBERT-base, and DistilMulti-BERT-base-ro. The first two models resulted from the individual distillation of knowledge from two base versions of Romanian BERTs available in literature, while the last one was obtained by distilling their ensemble. To our knowledge, this is the first attempt to create publicly available Romanian distilled BERT models, which were thoroughly evaluated on five tasks: part-of-speech tagging, named entity recognition, sentiment analysis, semantic textual similarity, and dialect identification. Our experimental results argue that the three distilled models maintain most performance in terms of accuracy with their teachers, while being twice as fast on a GPU and ~35% smaller. In addition, we further test the similarity between the predictions of our students versus their teachers by measuring their label and probability loyalty, together with regression loyalty - a new metric introduced in this work.
    Scalable Diverse Model Selection for Accessible Transfer Learning. (arXiv:2111.06977v2 [cs.LG] UPDATED)
    (2 min) With the preponderance of pretrained deep learning models available off-the-shelf from model banks today, finding the best weights to fine-tune to your use-case can be a daunting task. Several methods have recently been proposed to find good models for transfer learning, but they either don't scale well to large model banks or don't perform well on the diversity of off-the-shelf models. Ideally the question we want to answer is, "given some data and a source model, can you quickly predict the model's accuracy after fine-tuning?" In this paper, we formalize this setting as "Scalable Diverse Model Selection" and propose several benchmarks for evaluating on this task. We find that existing model selection and transferability estimation methods perform poorly here and analyze why this is the case. We then introduce simple techniques to improve the performance and speed of these algorithms. Finally, we iterate on existing methods to create PARC, which outperforms all other methods on diverse model selection. We have released the benchmarks and method code in hope to inspire future work in model selection for accessible transfer learning.
    Extending Environments To Measure Self-Reflection In Reinforcement Learning. (arXiv:2110.06890v2 [cs.AI] UPDATED)
    (2 min) We consider an extended notion of reinforcement learning in which the environment can simulate the agent and base its outputs on the agent's hypothetical behavior. Since good performance usually requires paying attention to whatever things the environment's outputs are based on, we argue that for an agent to achieve on-average good performance across many such extended environments, it is necessary for the agent to self-reflect. Thus, an agent's self-reflection ability can be numerically estimated by running the agent through a battery of extended environments. We are simultaneously releasing an open-source library of extended environments to serve as proof-of-concept of this technique. As the library is first-of-kind, we have avoided the difficult problem of optimizing it. Instead we have chosen environments with interesting properties. Some seem paradoxical, some lead to interesting thought experiments, some are even suggestive of how self-reflection might have evolved in nature. We give examples and introduce a simple transformation which experimentally seems to increase self-reflection.
    Improving language models by retrieving from trillions of tokens. (arXiv:2112.04426v2 [cs.CL] UPDATED)
    (2 min) We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.
    Directional Message Passing on Molecular Graphs via Synthetic Coordinates. (arXiv:2111.04718v2 [cs.LG] UPDATED)
    (2 min) Graph neural networks that leverage coordinates via directional message passing have recently set the state of the art on multiple molecular property prediction tasks. However, they rely on atom position information that is often unavailable, and obtaining it is usually prohibitively expensive or even impossible. In this paper we propose synthetic coordinates that enable the use of advanced GNNs without requiring the true molecular configuration. We propose two distances as synthetic coordinates: Distance bounds that specify the rough range of molecular configurations, and graph-based distances using a symmetric variant of personalized PageRank. To leverage both distance and angular information we propose a method of transforming normal graph neural networks into directional MPNNs. We show that with this transformation we can reduce the error of a normal graph neural network by 55% on the ZINC benchmark. We furthermore set the state of the art on ZINC and coordinate-free QM9 by incorporating synthetic coordinates in the SMP and DimeNet++ models. Our implementation is available online.
    NarrationBot and InfoBot: A Hybrid System for Automated Video Description. (arXiv:2111.03994v2 [cs.HC] UPDATED)
    (2 min) Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional and amateur services and tools, most human-generated descriptions are expensive and time consuming. Moreover, the rate of human-generated descriptions cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a hybrid system of two tools to 1) automatically generate descriptions for videos and 2) provide answers or additional descriptions in response to user queries on a video. Results from a mixed-methods study with 26 blind and low vision individuals show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem. In addition, participants reported no significant difference in their ability to understand videos when presented with autogenerated descriptions versus human-revised autogenerated descriptions. Our results demonstrate user enthusiasm about the developed system and its promise for providing customized access to videos. We discuss the limitations of the current work and provide recommendations for the future development of automated video description tools.
    Neural HMMs are all you need (for high-quality attention-free TTS). (arXiv:2108.13320v5 [eess.AS] UPDATED)
    (2 min) Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and the use of non-monotonic attention both increases training time and introduces "babbling" failure modes that are unacceptable in production. This paper demonstrates that the old and new paradigms can be combined to obtain the advantages of both worlds. In particular, we replace the attention in Tacotron 2 with an autoregressive left-right no-skip hidden Markov model defined by a neural network. This leads to an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximations. We discuss how to combine innovations from both classical and contemporary TTS for best results. The final system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving the same naturalness prior to the post-net. Our approach also allows easy control over speaking rate. Audio examples and code are available at https://shivammehta007.github.io/Neural-HMM/
    Pediatric Automatic Sleep Staging: Deep Learning Ensemble Improves Accuracy and Reduces Predictive Uncertainty. (arXiv:2108.10211v2 [eess.SP] UPDATED)
    (2 min) Despite the tremendous progress recently made towards automatic sleep staging in adults, it is currently unknown if the most advanced algorithms generalize to the pediatric population, which displays distinctive characteristics in overnight polysomnography (PSG). To answer the question, in this work, we conduct a large-scale comparative study on the state-of-the-art deep learning methods for pediatric automatic sleep staging. A selection of six different deep neural networks with diverging features are adopted to evaluate a sample of more than 1,200 children across a wide spectrum of obstructive sleep apnea (OSA) severity. Our experimental results show that the individual performance of automated pediatric sleep stagers when evaluated on new subjects is equivalent to the expert-level one reported on adults. Combining the six stagers into ensemble models further boosts the staging accuracy, reaching an overall accuracy of 87.7%, a Cohen's kappa of 0.837, and a macro F1-score of 84.2% in case of single-channel EEG when evaluated on new subjects. The performance is further improved when dual-channel EEG$\cdot$EOG are used, reaching an accuracy of 88.8%, a Cohen's kappa of 0.852, and a macro F1-score of 85.8%. At the same time, the ensemble models lead to reduced predictive uncertainty. The results also show that the studied algorithms and their ensembles are robust to concept drift when the training and test data were recorded 7-months apart and after clinical intervention. Detailed analyses further demonstrate "almost perfect" agreement between the automatic stagers to one another and their similar patterns on the staging errors.
    Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection. (arXiv:2107.05908v2 [cs.SE] UPDATED)
    (2 min) Logs have been an imperative resource to ensure the reliability and continuity of many software systems, especially large-scale distributed systems. They faithfully record runtime information to facilitate system troubleshooting and behavior understanding. Due to the large scale and complexity of modern software systems, the volume of logs has reached an unprecedented level. Consequently, for log-based anomaly detection, conventional manual inspection methods or even traditional machine learning-based methods become impractical, which serve as a catalyst for the rapid development of deep learning-based solutions. However, there is currently a lack of rigorous comparison among the representative log-based anomaly detectors that resort to neural networks. Moreover, the re-implementation process demands non-trivial efforts, and bias can be easily introduced. To better understand the characteristics of different anomaly detectors, in this paper, we provide a comprehensive review and evaluation of five popular neural networks used by six state-of-the-art methods. Particularly, four of the selected methods are unsupervised, and the remaining two are supervised. These methods are evaluated with two publicly available log datasets, which contain nearly 16 million log messages and 0.4 million anomaly instances in total. We believe our work can serve as a basis in this field and contribute to future academic research and industrial applications.
    Towards Real-World Deployment of Reinforcement Learning for Traffic Signal Control. (arXiv:2103.16223v3 [cs.LG] UPDATED)
    (2 min) Sub-optimal control policies in intersection traffic signal controllers (TSC) contribute to congestion and lead to negative effects on human health and the environment. Reinforcement learning (RL) for traffic signal control is a promising approach to design better control policies and has attracted considerable research interest in recent years. However, most work done in this area used simplified simulation environments of traffic scenarios to train RL-based TSC. To deploy RL in real-world traffic systems, the gap between simplified simulation environments and real-world applications has to be closed. Therefore, we propose LemgoRL, a benchmark tool to train RL agents as TSC in a realistic simulation environment of Lemgo, a medium-sized town in Germany. In addition to the realistic simulation model, LemgoRL encompasses a traffic signal logic unit that ensures compliance with all regulatory and safety requirements. LemgoRL offers the same interface as the wellknown OpenAI gym toolkit to enable easy deployment in existing research work. To demonstrate the functionality and applicability of LemgoRL, we train a state-of-the-art Deep RL algorithm on a CPU cluster utilizing a framework for distributed and parallel RL and compare its performance with other methods. Our benchmark tool drives the development of RL algorithms towards real-world applications.
    KNODE-MPC: A Knowledge-based Data-driven Predictive Control Framework for Aerial Robots. (arXiv:2109.04821v3 [cs.RO] UPDATED)
    (2 min) In this work, we consider the problem of deriving and incorporating accurate dynamic models for model predictive control (MPC) with an application to quadrotor control. MPC relies on precise dynamic models to achieve the desired closed-loop performance. However, the presence of uncertainties in complex systems and the environments they operate in poses a challenge in obtaining sufficiently accurate representations of the system dynamics. In this work, we make use of a deep learning tool, knowledge-based neural ordinary differential equations (KNODE), to augment a model obtained from first principles. The resulting hybrid model encompasses both a nominal first-principle model and a neural network learnt from simulated or real-world experimental data. Using a quadrotor, we benchmark our hybrid model against a state-of-the-art Gaussian Process (GP) model and show that the hybrid model provides more accurate predictions of the quadrotor dynamics and is able to generalize beyond the training data. To improve closed-loop performance, the hybrid model is integrated into a novel MPC framework, known as KNODE-MPC. Results show that the integrated framework achieves 60.2% improvement in simulations and more than 21% in physical experiments, in terms of trajectory tracking performance.
    GemNet: Universal Directional Graph Neural Networks for Molecules. (arXiv:2106.08903v6 [physics.comp-ph] UPDATED)
    (2 min) Effectively predicting molecular interactions has the potential to accelerate molecular dynamics by multiple orders of magnitude and thus revolutionize chemical simulations. Graph neural networks (GNNs) have recently shown great successes for this task, overtaking classical methods based on fixed molecular kernels. However, they still appear very limited from a theoretical perspective, since regular GNNs cannot distinguish certain types of graphs. In this work we close this gap between theory and practice. We show that GNNs with directed edge embeddings and two-hop message passing are indeed universal approximators for predictions that are invariant to translation, and equivariant to permutation and rotation. We then leverage these insights and multiple structural improvements to propose the geometric message passing neural network (GemNet). We demonstrate the benefits of the proposed changes in multiple ablation studies. GemNet outperforms previous models on the COLL, MD17, and OC20 datasets by 34%, 41%, and 20%, respectively, and performs especially well on the most challenging molecules. Our implementation is available online.
    Meta-Learning Reliable Priors in the Function Space. (arXiv:2106.03195v2 [cs.LG] UPDATED)
    (2 min) When data are scarce meta-learning can improve a learner's accuracy by harnessing previous experience from related learning tasks. However, existing methods have unreliable uncertainty estimates which are often overconfident. Addressing these shortcomings, we introduce a novel meta-learning framework, called F-PACOH, that treats meta-learned priors as stochastic processes and performs meta-level regularization directly in the function space. This allows us to directly steer the probabilistic predictions of the meta-learner towards high epistemic uncertainty in regions of insufficient meta-training data and, thus, obtain well-calibrated uncertainty estimates. Finally, we showcase how our approach can be integrated with sequential decision making, where reliable uncertainty quantification is imperative. In our benchmark study on meta-learning for Bayesian Optimization (BO), F-PACOH significantly outperforms all other meta-learners and standard baselines.
    MED-TEX: Transferring and Explaining Knowledge with Less Data from Pretrained Medical Imaging Models. (arXiv:2008.02593v2 [cs.CV] UPDATED)
    (2 min) Deep learning methods usually require a large amount of training data and lack interpretability. In this paper, we propose a novel knowledge distillation and model interpretation framework for medical image classification that jointly solves the above two issues. Specifically, to address the data-hungry issue, a small student model is learned with less data by distilling knowledge from a cumbersome pretrained teacher model. To interpret the teacher model and assist the learning of the student, an explainer module is introduced to highlight the regions of an input that are important for the predictions of the teacher model. Furthermore, the joint framework is trained by a principled way derived from the information-theoretic perspective. Our framework outperforms on the knowledge distillation and model interpretation tasks compared to state-of-the-art methods on a fundus dataset.
    Do We Exploit all Information for Counterfactual Analysis? Benefits of Factor Models and Idiosyncratic Correction. (arXiv:2011.03996v3 [econ.EM] UPDATED)
    (2 min) Optimal pricing, i.e., determining the price level that maximizes profit or revenue of a given product, is a vital task for the retail industry. To select such a quantity, one needs first to estimate the price elasticity from the product demand. Regression methods usually fail to recover such elasticities due to confounding effects and price endogeneity. Therefore, randomized experiments are typically required. However, elasticities can be highly heterogeneous depending on the location of stores, for example. As the randomization frequently occurs at the municipal level, standard difference-in-differences methods may also fail. Possible solutions are based on methodologies to measure the effects of treatments on a single (or just a few) treated unit(s) based on counterfactuals constructed from artificial controls. For example, for each city in the treatment group, a counterfactual may be constructed from the untreated locations. In this paper, we apply a novel high-dimensional statistical method to measure the effects of price changes on daily sales from a major retailer in Brazil. The proposed methodology combines principal components (factors) and sparse regressions, resulting in a method called Factor-Adjusted Regularized Method for Treatment evaluation (\texttt{FarmTreat}). The data consist of daily sales and prices of five different products over more than 400 municipalities. The products considered belong to the \emph{sweet and candies} category and experiments have been conducted over the years of 2016 and 2017. Our results confirm the hypothesis of a high degree of heterogeneity yielding very different pricing strategies over distinct municipalities.
    FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout. (arXiv:2102.13451v5 [cs.LG] UPDATED)
    (2 min) Federated Learning (FL) has been gaining significant traction across different ML tasks, ranging from vision to keyboard predictions. In large-scale deployments, client heterogeneity is a fact and constitutes a primary problem for fairness, training performance and accuracy. Although significant efforts have been made into tackling statistical data heterogeneity, the diversity in the processing capabilities and network bandwidth of clients, termed as system heterogeneity, has remained largely unexplored. Current solutions either disregard a large portion of available devices or set a uniform limit on the model's capacity, restricted by the least capable participants. In this work, we introduce Ordered Dropout, a mechanism that achieves an ordered, nested representation of knowledge in deep neural networks (DNNs) and enables the extraction of lower footprint submodels without the need of retraining. We further show that for linear maps our Ordered Dropout is equivalent to SVD. We employ this technique, along with a self-distillation methodology, in the realm of FL in a framework called FjORD. FjORD alleviates the problem of client system heterogeneity by tailoring the model width to the client's capabilities. Extensive evaluation on both CNNs and RNNs across diverse modalities shows that FjORD consistently leads to significant performance gains over state-of-the-art baselines, while maintaining its nested structure.
    MICo: Improved representations via sampling-based state similarity for Markov decision processes. (arXiv:2106.08229v3 [cs.LG] UPDATED)
    (2 min) We present a new behavioural distance over the state space of a Markov decision process, and demonstrate the use of this distance as an effective means of shaping the learnt representations of deep reinforcement learning agents. While existing notions of state similarity are typically difficult to learn at scale due to high computational cost and lack of sample-based algorithms, our newly-proposed distance addresses both of these issues. In addition to providing detailed theoretical analysis, we provide empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark.
    Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech. (arXiv:2011.01174v4 [eess.AS] UPDATED)
    (2 min) Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied universally (i.e., regardless of the TTS model architecture or the cause of speech quality degradation) and efficiently (i.e., without increasing the inference time or model complexity). The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.
    Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment Analysis. (arXiv:2201.03969v1 [cs.LG])
    (2 min) Multimodal sentiment analysis (MSA) is a fundamental complex research problem due to the heterogeneity gap between different modalities and the ambiguity of human emotional expression. Although there have been many successful attempts to construct multimodal representations for MSA, there are still two challenges to be addressed: 1) A more robust multimodal representation needs to be constructed to bridge the heterogeneity gap and cope with the complex multimodal interactions, and 2) the contextual dynamics must be modeled effectively throughout the information flow. In this work, we propose a multimodal representation model based on Mutual information Maximization and Minimization and Identity Embedding (MMMIE). We combine mutual information maximization between modal pairs, and mutual information minimization between input data and corresponding features to mine the modal-invariant and task-related information. Furthermore, Identity Embedding is proposed to prompt the downstream network to perceive the contextual information. Experimental results on two public datasets demonstrate the effectiveness of the proposed model.
    STFL: A Temporal-Spatial Federated Learning Framework for Graph Neural Networks. (arXiv:2111.06750v2 [cs.LG] UPDATED)
    (2 min) We present a spatial-temporal federated learning framework for graph neural networks, namely STFL. The framework explores the underlying correlation of the input spatial-temporal data and transform it to both node features and adjacency matrix. The federated learning setting in the framework ensures data privacy while achieving a good model generalization. Experiments results on the sleep stage dataset, ISRUC_S3, illustrate the effectiveness of STFL on graph prediction tasks.
    rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method. (arXiv:1906.03588v2 [cs.SD] UPDATED)
    (3 min) This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.
    A general class of surrogate functions for stable and efficient reinforcement learning. (arXiv:2108.05828v2 [cs.LG] UPDATED)
    (2 min) Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
    Bootstrapping Informative Graph Augmentation via A Meta Learning Approach. (arXiv:2201.03812v1 [cs.LG])
    (2 min) Recent works explore learning graph representations in a self-supervised manner. In graph contrastive learning, benchmark methods apply various graph augmentation approaches. However, most of the augmentation methods are non-learnable, which causes the issue of generating unbeneficial augmented graphs. Such augmentation may degenerate the representation ability of graph contrastive learning methods. Therefore, we motivate our method to generate augmented graph by a learnable graph augmenter, called MEta Graph Augmentation (MEGA). We then clarify that a "good" graph augmentation must have uniformity at the instance-level and informativeness at the feature-level. To this end, we propose a novel approach to learning a graph augmenter that can generate an augmentation with uniformity and informativeness. The objective of the graph augmenter is to promote our feature extraction network to learn a more discriminative feature representation, which motivates us to propose a meta-learning paradigm. Empirically, the experiments across multiple benchmark datasets demonstrate that MEGA outperforms the state-of-the-art methods in graph self-supervised learning tasks. Further experimental studies prove the effectiveness of different terms of MEGA.
    Causal Inference in medicine and in health policy, a summary. (arXiv:2105.04655v4 [cs.LG] UPDATED)
    (2 min) A data science task can be deemed as making sense of the data or testing a hypothesis about it. The conclusions inferred from data can greatly guide us to make informative decisions. Big data has enabled us to carry out countless prediction tasks in conjunction with machine learning, such as identifying high risk patients suffering from a certain disease and taking preventable measures. However, healthcare practitioners are not content with mere predictions - they are also interested in the cause-effect relation between input features and clinical outcomes. Understanding such relations will help doctors treat patients and reduce the risk effectively. Causality is typically identified by randomized controlled trials. Often such trials are not feasible when scientists and researchers turn to observational studies and attempt to draw inferences. However, observational studies may also be affected by selection and/or confounding biases that can result in wrong causal conclusions. In this chapter, we will try to highlight some of the drawbacks that may arise in traditional machine learning and statistical approaches to analyze the observational data, particularly in the healthcare data analytics domain. We will discuss causal inference and ways to discover the cause-effect from observational studies in healthcare domain. Moreover, we will demonstrate the applications of causal inference in tackling some common machine learning issues such as missing data and model transportability. Finally, we will discuss the possibility of integrating reinforcement learning with causality as a way to counter confounding bias.
    The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks. (arXiv:2010.08488v4 [stat.ML] UPDATED)
    (2 min) Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate Gaussian process covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited Gaussian process in the output space of the network. In contrast to existing work on the connection between neural networks and Gaussian processes, our analysis is non-asymptotic, with finite sample-size error bounds provided. This establishes the universality property that a Bayesian neural network can approximate any Gaussian process whose covariance function is sufficiently regular. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which a suitable Gaussian process prior can be provided.
    Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data. (arXiv:2201.03941v1 [cs.CL])
    (2 min) The relationship between Facebook posts and the corresponding reaction feature is an interesting subject to explore and understand. To archive this end, we test state-of-the-art Sinhala sentiment analysis models against a data set containing a decade worth of Sinhala posts with millions of reactions. For the purpose of establishing benchmarks and with the goal of identifying the best model for Sinhala sentiment analysis, we also test, on the same data set configuration, other deep learning models catered for sentiment analysis. In this study we report that the 3 layer Bidirectional LSTM model achieves an F1 score of 84.58% for Sinhala sentiment analysis, surpassing the current state-of-the-art model; Capsule B, which only manages to get an F1 score of 82.04%. Further, since all the deep learning models show F1 scores above 75% we conclude that it is safe to claim that Facebook reactions are suitable to predict the sentiment of a text.
    RNNs on Monitoring Physical Activity Energy Expenditure in Older People. (arXiv:2006.01169v2 [eess.SP] UPDATED)
    (3 min) Through the quantification of physical activity energy expenditure (PAEE), health care monitoring has the potential to stimulate vital and healthy ageing, inducing behavioural changes in older people and linking these to personal health gains. To be able to measure PAEE in a monitoring environment, methods from wearable accelerometers have been developed, however, mainly targeted towards younger people. Since elderly subjects differ in energy requirements and range of physical activities, the current models may not be suitable for estimating PAEE among the elderly. Because past activities influence present PAEE, we propose a modeling approach known for its ability to model sequential data, the Recurrent Neural Network (RNN). To train the RNN for an elderly population, we used the GOTOV dataset with 34 healthy participants of 60 years and older (mean 65 years old), performing 16 different activities. We used accelerometers placed on wrist and ankle, and measurements of energy counts by means of indirect calorimetry. After optimization, we propose an architecture consisting of an RNN with 3 GRU layers and a feedforward network combining both accelerometer and participant-level data. In this paper, we describe our efforts to go beyond the standard facilities of a GRU-based RNN, with the aim of achieving accuracy surpassing the state of the art. These efforts include switching aggregation function from mean to dispersion measures (SD, IQR, ...), combining temporal and static data (person-specific details such as age, weight, BMI) and adding symbolic activity data as predicted by a previously trained ML model. The resulting architecture manages to increase its performance by approximatelly 10% while decreasing training input by a factor of 10. It can thus be employed to investigate associations of PAEE with vitality parameters related to metabolic and cognitive health and mental well-being.
    Hybrid Car-Following Strategy based on Deep Deterministic Policy Gradient and Cooperative Adaptive Cruise Control. (arXiv:2103.03796v2 [cs.AI] UPDATED)
    (2 min) Deep deterministic policy gradient (DDPG)-based car-following strategy can break through the constraints of the differential equation model due to the ability of exploration on complex environments. However, the car-following performance of DDPG is usually degraded by unreasonable reward function design, insufficient training, and low sampling efficiency. In order to solve this kind of problem, a hybrid car-following strategy based on DDPG and cooperative adaptive cruise control (CACC) is proposed. First, the car-following process is modeled as the Markov decision process to calculate CACC and DDPG simultaneously at each frame. Given a current state, two actions are obtained from CACC and DDPG, respectively. Then, an optimal action, corresponding to the one offering a larger reward, is chosen as the output of the hybrid strategy. Meanwhile, a rule is designed to ensure that the change rate of acceleration is smaller than the desired value. Therefore, the proposed strategy not only guarantees the basic performance of car-following through CACC but also makes full use of the advantages of exploration on complex environments via DDPG. Finally, simulation results show that the car-following performance of the proposed strategy is improved compared with that of DDPG and CACC.
    Systematic Literature Review: Quantum Machine Learning and its applications. (arXiv:2201.04093v1 [quant-ph])
    (2 min) Quantum computing is the process of performing calculations using quantum mechanics. This field studies the quantum behavior of certain subatomic particles for subsequent use in performing calculations, as well as for large-scale information processing. These capabilities can give quantum computers an advantage in terms of computational time and cost over classical computers. Nowadays, there are scientific challenges that are impossible to perform by classical computation due to computational complexity or the time the calculation would take, and quantum computation is one of the possible answers. However, current quantum devices have not yet the necessary qubits and are not fault-tolerant enough to achieve these goals. Nonetheless, there are other fields like machine learning or chemistry where quantum computation could be useful with current quantum devices. This manuscript aims to present a Systematic Literature Review of the papers published between 2017 and 2021 to identify, analyze and classify the different algorithms used in quantum machine learning and their applications. Consequently, this study identified 52 articles that used quantum machine learning techniques and algorithms. The main types of found algorithms are quantum implementations of classical machine learning algorithms, such as support vector machines or the k-nearest neighbor model, and classical deep learning algorithms, like quantum neural networks. Many articles try to solve problems currently answered by classical machine learning but using quantum devices and algorithms. Even though results are promising, quantum machine learning is far from achieving its full potential. An improvement in the quantum hardware is required since the existing quantum computers lack enough quality, speed, and scale to allow quantum computing to achieve its full potential.
    DensE: An Enhanced Non-commutative Representation for Knowledge Graph Embedding with Adaptive Semantic Hierarchy. (arXiv:2008.04548v2 [cs.AI] UPDATED)
    (2 min) Capturing the composition patterns of relations is a vital task in knowledge graph completion. It also serves as a fundamental step towards multi-hop reasoning over learned knowledge. Previously, several rotation-based translational methods have been developed to model composite relations using the product of a series of complex-valued diagonal matrices. However, these methods tend to make several oversimplified assumptions on the composite relations, e.g., forcing them to be commutative, independent from entities and lacking semantic hierarchy. To systematically tackle these problems, we have developed a novel knowledge graph embedding method, named DensE, to provide an improved modeling scheme for the complex composition patterns of relations. In particular, our method decomposes each relation into an SO(3) group-based rotation operator and a scaling operator in the three dimensional (3-D) Euclidean space. This design principle leads to several advantages of our method: (1) For composite relations, the corresponding diagonal relation matrices can be non-commutative, reflecting a predominant scenario in real world applications; (2) Our model preserves the natural interaction between relational operations and entity embeddings; (3) The scaling operation provides the modeling power for the intrinsic semantic hierarchical structure of entities; (4) The enhanced expressiveness of DensE is achieved with high computational efficiency in terms of both parameter size and training time; and (5) Modeling entities in Euclidean space instead of quaternion space keeps the direct geometrical interpretations of relational patterns. Experimental results on multiple benchmark knowledge graphs show that DensE outperforms the current state-of-the-art models for missing link prediction, especially on composite relations.
    Designing ECG Monitoring Healthcare System with Federated Transfer Learning and Explainable AI. (arXiv:2105.12497v2 [cs.LG] UPDATED)
    (2 min) Deep learning play a vital role in classifying different arrhythmias using the electrocardiography (ECG) data. Nevertheless, training deep learning models normally requires a large amount of data and it can lead to privacy concerns. Unfortunately, a large amount of healthcare data cannot be easily collected from a single silo. Additionally, deep learning models are like black-box, with no explainability of the predicted results, which is often required in clinical healthcare. This limits the application of deep learning in real-world health systems. In this paper, we design a new explainable artificial intelligence (XAI) based deep learning framework in a federated setting for ECG-based healthcare applications. The federated setting is used to solve issues such as data availability and privacy concerns. Furthermore, the proposed framework setting effectively classifies arrhythmia's using an autoencoder and a classifier, both based on a convolutional neural network (CNN). Additionally, we propose an XAI-based module on top of the proposed classifier to explain the classification results, which help clinical practitioners make quick and reliable decisions. The proposed framework was trained and tested using the MIT-BIH Arrhythmia database. The classifier achieved accuracy up to 94% and 98% for arrhythmia detection using noisy and clean data, respectively, with five-fold cross-validation.
    Emotion Intensity and its Control for Emotional Voice Conversion. (arXiv:2201.03967v1 [cs.SD])
    (2 min) Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.
    quantum Case-Based Reasoning (qCBR). (arXiv:2104.00409v2 [cs.AI] UPDATED)
    (2 min) Case-Based Reasoning (CBR) is an artificial intelligence approach to problem-solving with a good record of success. This article proposes using Quantum Computing to improve some of the key processes of CBR, such that a Quantum Case-Based Reasoning (qCBR) paradigm can be defined. The focus is set on designing and implementing a qCBR based on the variational principle that improves its classical counterpart in terms of average accuracy, scalability and tolerance to overlapping. A comparative study of the proposed qCBR with a classic CBR is performed for the case of the Social Workers' Problem as a sample of a combinatorial optimization problem with overlapping. The algorithm's quantum feasibility is modelled with docplex and tested on IBMQ computers, and experimented on the Qibo framework.
    Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity. (arXiv:2004.12908v12 [cs.AI] UPDATED)
    (3 min) In biological learning, data are used to improve performance not only on the current task, but also on previously encountered, and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called catastrophic forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of lifelong learning, whether biological or artificial, should be to improve performance on both past and future tasks with any new data. Our key insight is that we can ensemble representations learned independently across tasks to transfer omnidirectionally, that is, jointly improve performance on both future unforeseen tasks (forward transfer) and past tasks (backward transfer). Specifically, we propose two lifelong learners: one ensembling trees and the other ensembling networks. In both cases, the algorithms are able to learn synergistically, improving performance on both past and future tasks in a variety of simulated and real data scenarios, including tabular data, image data, spoken data, and adversarial tasks. Moreover, they can do so with quasilinear space and time complexity, a requirement for any bona fide lifelong learning system.
    Data transformation based optimized customer churn prediction model for the telecommunication industry. (arXiv:2201.04088v1 [cs.LG])
    (2 min) Data transformation (DT) is a process that transfers the original data into a form which supports a particular classification algorithm and helps to analyze the data for a special purpose. To improve the prediction performance we investigated various data transform methods. This study is conducted in a customer churn prediction (CCP) context in the telecommunication industry (TCI), where customer attrition is a common phenomenon. We have proposed a novel approach of combining data transformation methods with the machine learning models for the CCP problem. We conducted our experiments on publicly available TCI datasets and assessed the performance in terms of the widely used evaluation measures (e.g. AUC, precision, recall, and F-measure). In this study, we presented comprehensive comparisons to affirm the effect of the transformation methods. The comparison results and statistical test proved that most of the proposed data transformation based optimized models improve the performance of CCP significantly. Overall, an efficient and optimized CCP model for the telecommunication industry has been presented through this manuscript.
    Spectrum Surveying: Active Radio Map Estimation with Autonomous UAVs. (arXiv:2201.04125v1 [eess.SP])
    (2 min) Radio maps find numerous applications in wireless communications and mobile robotics tasks, including resource allocation, interference coordination, and mission planning. Although numerous techniques have been proposed to construct radio maps from spatially distributed measurements, the locations of such measurements are assumed predetermined beforehand. In contrast, this paper proposes spectrum surveying, where a mobile robot such as an unmanned aerial vehicle (UAV) collects measurements at a set of locations that are actively selected to obtain high-quality map estimates in a short surveying time. This is performed in two steps. First, two novel algorithms, a model-based online Bayesian estimator and a data-driven deep learning algorithm, are devised for updating a map estimate and an uncertainty metric that indicates the informativeness of measurements at each possible location. These algorithms offer complementary benefits and feature constant complexity per measurement. Second, the uncertainty metric is used to plan the trajectory of the UAV to gather measurements at the most informative locations. To overcome the combinatorial complexity of this problem, a dynamic programming approach is proposed to obtain lists of waypoints through areas of large uncertainty in linear time. Numerical experiments conducted on a realistic dataset confirm that the proposed scheme constructs accurate radio maps quickly.
    Deep Neural Network Approximation For H\"older Functions. (arXiv:2201.03747v1 [cs.LG])
    (2 min) In this work, we explore the approximation capability of deep Rectified Quadratic Unit neural networks for H\"older-regular functions, with respect to the uniform norm. We find that theoretical approximation heavily depends on the selected activation function in the neural network.
    M2: Mixed Models with Preferences, Popularities and Transitions for Next-Basket Recommendation. (arXiv:2004.01646v3 [cs.LG] UPDATED)
    (2 min) Next-basket recommendation considers the problem of recommending a set of items into the next basket that users will purchase as a whole. In this paper, we develop a novel mixed model with preferences, popularities and transitions (M2) for the next-basket recommendation. This method models three important factors in next-basket generation process: 1) users' general preferences, 2) items' global popularities and 3) transition patterns among items. Unlike existing recurrent neural network-based approaches, M2 does not use the complicated networks to model the transitions among items, or generate embeddings for users. Instead, it has a simple encoder-decoder based approach (ed-Trans) to better model the transition patterns among items. We compared M2 with different combinations of the factors with 5 state-of-the-art next-basket recommendation methods on 4 public benchmark datasets in recommending the first, second and third next basket. Our experimental results demonstrate that M2 significantly outperforms the state-of-the-art methods on all the datasets in all the tasks, with an improvement of up to 22.1%. In addition, our ablation study demonstrates that the ed-Trans is more effective than recurrent neural networks in terms of the recommendation performance. We also have a thorough discussion on various experimental protocols and evaluation metrics for next-basket recommendation evaluation.
    DynO: Dynamic Onloading of Deep Neural Networks from Cloud to Device. (arXiv:2104.09949v2 [cs.DC] UPDATED)
    (2 min) Recently, there has been an explosive growth of mobile and embedded applications using convolutional neural networks(CNNs). To alleviate their excessive computational demands, developers have traditionally resorted to cloud offloading, inducing high infrastructure costs and a strong dependence on networking conditions. On the other end, the emergence of powerful SoCs is gradually enabling on-device execution. Nonetheless, low- and mid-tier platforms still struggle to run state-of-the-art CNNs sufficiently. In this paper, we present DynO, a distributed inference framework that combines the best of both worlds to address several challenges, such as device heterogeneity, varying bandwidth and multi-objective requirements. Key components that enable this are its novel CNN-specific data packing method, which exploits the variability of precision needs in different parts of the CNN when onloading computation, and its novel scheduler that jointly tunes the partition point and transferred data precision at run time to adapt inference to its execution environment. Quantitative evaluation shows that DynO outperforms the current state-of-the-art, improving throughput by over an order of magnitude over device-only execution and up to 7.9x over competing CNN offloading systems, with up to 60x less data transferred.
    Learning to Denoise Raw Mobile UI Layouts for ImprovingDatasets at Scale. (arXiv:2201.04100v1 [cs.HC])
    (2 min) The layout of a mobile screen is a critical data source for UI designresearch and semantic understanding of the screen. However, UIlayouts in existing datasets are often noisy, have mismatches withtheir visual representation, or consists of generic or app-specifictypes that are difficult to analyze and model. In this paper, wepropose the CLAY pipeline that uses a deep learning approach fordenoising UI layouts, allowing us to automatically improve existingmobile UI layout datasets at scale. Our pipeline takes both thescreenshot and the raw UI layout, and annotates the raw layout byremoving incorrect nodes and assigning a semantically meaningfultype to each node. To experiment with our data-cleaning pipeline,we create the CLAY dataset of 59,555 human-annotated screenlayouts, based on screenshots and raw layouts from Rico, a publicmobile UI corpus. Our deep models achieve high accuracy withF1 scores of 82.7% for detecting layout objects that do not have avalid visual representation and 85.9% for recognizing object types,which significantly outperforms a heuristic baseline. Our work laysa foundation for creating large-scale high quality UI layout datasetsfor data-driven mobile UI research and reduces the need of manuallabeling efforts that are prohibitively expensive.
    Automated Reinforcement Learning (AutoRL): A Survey and Open Problems. (arXiv:2201.03916v1 [cs.LG])
    (2 min) The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path towards generally capable agents. However, the success of RL agents is often highly sensitive to design choices in the training process, which may require tedious and error-prone manual tuning. This makes it challenging to use RL for new problems, while also limits its full potential. In many other areas of machine learning, AutoML has shown it is possible to automate such design choices and has also yielded promising initial results when applied to RL. However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go. Given the diversity of methods and environments considered in RL, much of the research has been conducted in distinct subfields, ranging from meta-learning to evolution. In this survey we seek to unify the field of AutoRL, we provide a common taxonomy, discuss each area in detail and pose open problems which would be of interest to researchers going forward.
    An Embedding Framework for Consistent Polyhedral Surrogates. (arXiv:1907.07330v2 [cs.LG] UPDATED)
    (2 min) We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g.\ rankings) as a point in $\mathbb{R}^d$, assigns the original loss values to these points, and "convexifies" the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses. Given any polyhedral loss $L$, we give a construction of a link function through which $L$ is a consistent surrogate for the loss it embeds. Conversely, we show how to construct a consistent polyhedral surrogate for any given discrete loss. Our framework yields succinct proofs of consistency or inconsistency of various polyhedral surrogates in the literature, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We show some additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates.
    ExBrainable: An Open-Source GUI for CNN-based EEG Decoding and Model Interpretation. (arXiv:2201.04065v1 [eess.SP])
    (2 min) We have developed a graphic user interface (GUI), ExBrainable, dedicated to convolutional neural networks (CNN) model training and visualization in electroencephalography (EEG) decoding. Available functions include model training, evaluation, and parameter visualization in terms of temporal and spatial representations. We demonstrate these functions using a well-studied public dataset of motor-imagery EEG and compare the results with existing knowledge of neuroscience. The primary objective of ExBrainable is to provide a fast, simplified, and user-friendly solution of EEG decoding for investigators across disciplines to leverage cutting-edge methods in brain/neuroscience research.
    On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering. (arXiv:2201.03965v1 [cs.CV])
    (2 min) In recent years, multi-modal transformers have shown significant progress in Vision-Language tasks, such as Visual Question Answering (VQA), outperforming previous architectures by a considerable margin. This improvement in VQA is often attributed to the rich interactions between vision and language streams. In this work, we investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question. We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers. We evaluate the effect of the following critical components on visual attention of a state-of-the-art VQA model: (i) number of object region proposals, (ii) question part of speech (POS) tags, (iii) question semantics, (iv) number of co-attention layers, and (v) answer accuracy. We compare the neural network attention maps against human attention maps both qualitatively and quantitatively. Our findings indicate that co-attention transformer modules are crucial in attending to relevant regions of the image given a question. Importantly, we observe that the semantic meaning of the question is not what drives visual attention, but specific keywords in the question do. Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models and networks that simultaneously process visual and language streams.
    In Defense of the Unitary Scalarization for Deep Multi-Task Learning. (arXiv:2201.04122v1 [cs.LG])
    (2 min) Recent multi-task learning research argues against unitary scalarization, where training simply minimizes the sum of the task losses. Several ad-hoc multi-task optimization algorithms have instead been proposed, inspired by various hypotheses about what makes multi-task settings difficult. The majority of these optimizers require per-task gradients, and introduce significant memory, runtime, and implementation overhead. We present a theoretical analysis suggesting that many specialized multi-task optimizers can be interpreted as forms of regularization. Moreover, we show that, when coupled with standard regularization and stabilization techniques from single-task learning, unitary scalarization matches or improves upon the performance of complex multi-task optimizers in both supervised and reinforcement learning settings. We believe our results call for a critical reevaluation of recent research in the area.
    Application of Common Spatial Patterns in Gravitational Waves Detection. (arXiv:2201.04086v1 [gr-qc])
    (2 min) Common Spatial Patterns (CSP) is a feature extraction algorithm widely used in Brain-Computer Interface (BCI) Systems for detecting Event-Related Potentials (ERPs) in multi-channel magneto/electroencephalography (MEG/EEG) time series data. In this article, we develop and apply a CSP algorithm to the problem of identifying whether a given epoch of multi-detector Gravitational Wave (GW) strains contains coalescenses. Paired with Signal Processing techniques and a Logistic Regression classifier, we find that our pipeline is correctly able to detect 76 out of 82 confident events from Gravitational Wave Transient Catalog, using H1 and L1 strains, with a classification score of $93.72 \pm 0.04\%$ using $10 \times 5$ cross validation. The false negative events were: GW170817-v3, GW191219 163120-v1, GW200115 042309-v2, GW200210 092254-v1, GW200220 061928-v1, and GW200322 091133-v1.
    Improving ECG Classification Interpretability using Saliency Maps. (arXiv:2201.04070v1 [eess.SP])
    (2 min) Cardiovascular disease is a large worldwide healthcare issue; symptoms often present suddenly with minimal warning. The electrocardiogram (ECG) is a fast, simple and reliable method of evaluating the health of the heart, by measuring electrical activity recorded through electrodes placed on the skin. ECGs often need to be analyzed by a cardiologist, taking time which could be spent on improving patient care and outcomes. Because of this, automatic ECG classification systems using machine learning have been proposed, which can learn complex interactions between ECG features and use this to detect abnormalities. However, algorithms built for this purpose often fail to generalize well to unseen data, reporting initially impressive results which drop dramatically when applied to new environments. Additionally, machine learning algorithms suffer a "black-box" issue, in which it is difficult to determine how a decision has been made. This is vital for applications in healthcare, as clinicians need to be able to verify the process of evaluation in order to trust the algorithm. This paper proposes a method for visualizing model decisions across each class in the MIT-BIH arrhythmia dataset, using adapted saliency maps averaged across complete classes to determine what patterns are being learned. We do this by building two algorithms based on state-of-the-art models. This paper highlights how these maps can be used to find problems in the model which could be affecting generalizability and model performance. Comparing saliency maps across complete classes gives an overall impression of confounding variables or other biases in the model, unlike what would be highlighted when comparing saliency maps on an ECG-by-ECG basis.
    Image quality measurements and denoising using Fourier Ring Correlations. (arXiv:2201.03992v1 [eess.IV])
    (2 min) Image quality is a nebulous concept with different meanings to different people. To quantify image quality a relative difference is typically calculated between a corrupted image and a ground truth image. But what metric should we use for measuring this difference? Ideally, the metric should perform well for both natural and scientific images. The structural similarity index (SSIM) is a good measure for how humans perceive image similarities, but is not sensitive to differences that are scientifically meaningful in microscopy. In electron and super-resolution microscopy, the Fourier Ring Correlation (FRC) is often used, but is little known outside of these fields. Here we show that the FRC can equally well be applied to natural images, e.g. the Google Open Images dataset. We then define a loss function based on the FRC, show that it is analytically differentiable, and use it to train a U-net for denoising of images. This FRC-based loss function allows the network to train faster and achieve similar or better results than when using L1- or L2- based losses. We also investigate the properties and limitations of neural network denoising with the FRC analysis.
    A hybrid estimation of distribution algorithm for joint stratification and sample allocation. (arXiv:2201.04068v1 [stat.ME])
    (2 min) In this study we propose a hybrid estimation of distribution algorithm (HEDA) to solve the joint stratification and sample allocation problem. This is a complex problem in which each the quality of each stratification from the set of all possible stratifications is measured its optimal sample allocation. EDAs are stochastic black-box optimization algorithms which can be used to estimate, build and sample probability models in the search for an optimal stratification. In this paper we enhance the exploitation properties of the EDA by adding a simulated annealing algorithm to make it a hybrid EDA. Results of empirical comparisons for atomic and continuous strata show that the HEDA attains the bests results found so far when compared to benchmark tests on the same data using a grouping genetic algorithm, simulated annealing algorithm or hill-climbing algorithm. However, the execution times and total execution are, in general, higher for the HEDA.
    Differentially Describing Groups of Graphs. (arXiv:2201.04064v1 [cs.SI])
    (2 min) How does neural connectivity in autistic children differ from neural connectivity in healthy children or autistic youths? What patterns in global trade networks are shared across classes of goods, and how do these patterns change over time? Answering questions like these requires us to differentially describe groups of graphs: Given a set of graphs and a partition of these graphs into groups, discover what graphs in one group have in common, how they systematically differ from graphs in other groups, and how multiple groups of graphs are related. We refer to this task as graph group analysis, which seeks to describe similarities and differences between graph groups by means of statistically significant subgraphs. To perform graph group analysis, we introduce Gragra, which uses maximum entropy modeling to identify a non-redundant set of subgraphs with statistically significant associations to one or more graph groups. Through an extensive set of experiments on a wide range of synthetic and real-world graph groups, we confirm that Gragra works well in practice.
    A novel method for error analysis in radiation thermometry with application to industrial furnaces. (arXiv:2201.04069v1 [eess.SP])
    (2 min) Accurate temperature measurements are essential for the proper monitoring and control of industrial furnaces. However, measurement uncertainty is a risk for such a critical parameter. Certain instrumental and environmental errors must be considered when using spectral-band radiation thermometry techniques, such as the uncertainty in the emissivity of the target surface, reflected radiation from surrounding objects, or atmospheric absorption and emission, to name a few. Undesired contributions to measured radiation can be isolated using measurement models, also known as error-correction models. This paper presents a methodology for budgeting significant sources of error and uncertainty during temperature measurements in a petrochemical furnace scenario. A continuous monitoring system is also presented, aided by a deep-learning-based measurement correction model, to allow domain experts to analyze the furnace's operation in real-time. To validate the proposed system's functionality, a real-world application case in a petrochemical plant is presented. The proposed solution demonstrates the viability of precise industrial furnace monitoring, thereby increasing operational security and improving the efficiency of such energy-intensive systems.
    Training-Free Uncertainty Estimation for Dense Regression: Sensitivity as a Surrogate. (arXiv:1910.04858v3 [cs.CV] UPDATED)
    (2 min) Uncertainty estimation is an essential step in the evaluation of the robustness for deep learning models in computer vision, especially when applied in risk-sensitive areas. However, most state-of-the-art deep learning models either fail to obtain uncertainty estimation or need significant modification (e.g., formulating a proper Bayesian treatment) to obtain it. Most previous methods are not able to take an arbitrary model off the shelf and generate uncertainty estimation without retraining or redesigning it. To address this gap, we perform a systematic exploration into training-free uncertainty estimation for dense regression, an unrecognized yet important problem, and provide a theoretical construction justifying such estimations. We propose three simple and scalable methods to analyze the variance of outputs from a trained network under tolerable perturbations: infer-transformation, infer-noise, and infer-dropout. They operate solely during the inference, without the need to re-train, re-design, or fine-tune the models, as typically required by state-of-the-art uncertainty estimation methods. Surprisingly, even without involving such perturbations in training, our methods produce comparable or even better uncertainty estimation when compared to training-required state-of-the-art methods.
    End-To-End Optimization of LiDAR Beam Configuration for 3D Object Detection and Localization. (arXiv:2201.03860v1 [cs.RO])
    (2 min) Existing learning methods for LiDAR-based applications use 3D points scanned under a pre-determined beam configuration, e.g., the elevation angles of beams are often evenly distributed. Those fixed configurations are task-agnostic, so simply using them can lead to sub-optimal performance. In this work, we take a new route to learn to optimize the LiDAR beam configuration for a given application. Specifically, we propose a reinforcement learning-based learning-to-optimize (RL-L2O) framework to automatically optimize the beam configuration in an end-to-end manner for different LiDAR-based applications. The optimization is guided by the final performance of the target task and thus our method can be integrated easily with any LiDAR-based application as a simple drop-in module. The method is especially useful when a low-resolution (low-cost) LiDAR is needed, for instance, for system deployment at a massive scale. We use our method to search for the beam configuration of a low-resolution LiDAR for two important tasks: 3D object detection and localization. Experiments show that the proposed RL-L2O method improves the performance in both tasks significantly compared to the baseline methods. We believe that a combination of our method with the recent advances of programmable LiDARs can start a new research direction for LiDAR-based active perception. The code is publicly available at https://github.com/vniclas/lidar_beam_selection
    Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks. (arXiv:2201.03834v1 [cs.LG])
    (2 min) During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. In the search for more sample-efficient algorithms, a promising direction is to leverage as much external off-policy data as possible. One staple of this data-driven approach is to learn from expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. Our experiments focus on manipulation robotics, specifically on three tasks for a 6 degrees-of-freedom robotic arm in simulation. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks, even in the absence of demonstrations. Furthermore, integrating into our method two improvements from previous works allows our approach to outperform all baselines.
    DDG-DA: Data Distribution Generation for Predictable Concept Drift Adaptation. (arXiv:2201.04038v1 [cs.LG])
    (2 min) In many real-world scenarios, we often deal with streaming data that is sequentially collected over time. Due to the non-stationary nature of the environment, the streaming data distribution may change in unpredictable ways, which is known as concept drift. To handle concept drift, previous methods first detect when/where the concept drift happens and then adapt models to fit the distribution of the latest data. However, there are still many cases that some underlying factors of environment evolution are predictable, making it possible to model the future concept drift trend of the streaming data, while such cases are not fully explored in previous work. In this paper, we propose a novel method DDG-DA, that can effectively forecast the evolution of data distribution and improve the performance of models. Specifically, we first train a predictor to estimate the future data distribution, then leverage it to generate training samples, and finally train models on the generated data. We conduct experiments on three real-world tasks (forecasting on stock price trend, electricity load and solar irradiance) and obtain significant improvement on multiple widely-used models.
    State Estimation in Electric Power Systems Leveraging Graph Neural Networks. (arXiv:2201.04056v1 [cs.LG])
    (2 min) The goal of the state estimation (SE) algorithm is to estimate complex bus voltages as state variables based on the available set of measurements in the power system. Because phasor measurement units (PMUs) are increasingly being used in transmission power systems, there is a need for a fast SE solver that can take advantage of PMU high sampling rates. This paper proposes training a graph neural network (GNN) to learn the estimates given the PMU voltage and current measurements as inputs, with the intent of obtaining fast and accurate predictions during the evaluation phase. GNN is trained using synthetic datasets, created by randomly sampling sets of measurements in the power system and labelling them with a solution obtained using a linear SE with PMUs solver. The presented results display the accuracy of GNN predictions in various test scenarios and tackle the sensitivity of the predictions to the missing input data.
    An Introduction to Autoencoders. (arXiv:2201.03898v1 [cs.LG])
    (2 min) In this article, we will look at autoencoders. This article covers the mathematics and the fundamental concepts of autoencoders. We will discuss what they are, what the limitations are, the typical use cases, and we will look at some examples. We will start with a general introduction to autoencoders, and we will discuss the role of the activation function in the output layer and the loss function. We will then discuss what the reconstruction error is. Finally, we will look at typical applications as dimensionality reduction, classification, denoising, and anomaly detection. This paper contains the notes of a PhD-level lecture on autoencoders given in 2021.
    Verified Probabilistic Policies for Deep Reinforcement Learning. (arXiv:2201.03698v1 [cs.AI])
    (2 min) Deep reinforcement learning is an increasingly popular technique for synthesising policies to control an agent's interaction with its environment. There is also growing interest in formally verifying that such policies are correct and execute safely. Progress has been made in this area by building on existing work for verification of deep neural networks and of continuous-state dynamical systems. In this paper, we tackle the problem of verifying probabilistic policies for deep reinforcement learning, which are used to, for example, tackle adversarial environments, break symmetries and manage trade-offs. We propose an abstraction approach, based on interval Markov decision processes, that yields probabilistic guarantees on a policy's execution, and present techniques to build and solve these models using abstract interpretation, mixed-integer linear programming, entropy-based refinement and probabilistic model checking. We implement our approach and illustrate its effectiveness on a selection of reinforcement learning benchmarks.
    Atomistic Simulations for Reactions and Spectroscopy in the Era of Machine Learning -- Quo Vadis?. (arXiv:2201.03822v1 [physics.chem-ph])
    (2 min) Atomistic simulations using accurate energy functions can provide molecular-level insight into functional motions of molecules in the gas- and in the condensed phase. Together with recently developed and currently pursued efforts in integrating and combining this with machine learning techniques provides a unique opportunity to bring such dynamics simulations closer to reality. This perspective delineates the present status of the field from efforts of others in the field and some of your own work and discusses open questions and future prospects.
    A Likelihood Ratio based Domain Adaptation Method for E2E Models. (arXiv:2201.03655v1 [cs.CL])
    (2 min) End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants. While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem. Additionally, these models require paired audio and text training data, are computationally expensive and are difficult to adapt towards the fast evolving nature of conversational speech. In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities. We show that this method is effective in improving rare words recognition, and results in a relative improvement of 10% in 1-best word error rate (WER) and 10% in n-best Oracle WER (n=8) on multiple out-of-domain datasets without any degradation on a general dataset. We also show that complementing the contextual biasing adaptation with adaptation of a second-pass rescoring model gives additive WER improvements.
    Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices. (arXiv:2103.03239v4 [cs.LG] UPDATED)
    (2 min) Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce. However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth. As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols. In this work, we lift that restriction by proposing Moshpit All-Reduce - an iterative averaging protocol that exponentially converges to the global average. We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large from scratch using preemptible compute nodes.
    Optimal and Differentially Private Data Acquisition: Central and Local Mechanisms. (arXiv:2201.03968v1 [cs.GT])
    (2 min) We consider a platform's problem of collecting data from privacy sensitive users to estimate an underlying parameter of interest. We formulate this question as a Bayesian-optimal mechanism design problem, in which an individual can share her (verifiable) data in exchange for a monetary reward or services, but at the same time has a (private) heterogeneous privacy cost which we quantify using differential privacy. We consider two popular differential privacy settings for providing privacy guarantees for the users: central and local. In both settings, we establish minimax lower bounds for the estimation error and derive (near) optimal estimators for given heterogeneous privacy loss levels for users. Building on this characterization, we pose the mechanism design problem as the optimal selection of an estimator and payments that will elicit truthful reporting of users' privacy sensitivities. Under a regularity condition on the distribution of privacy sensitivities we develop efficient algorithmic mechanisms to solve this problem in both privacy settings. Our mechanism in the central setting can be implemented in time $\mathcal{O}(n \log n)$ where $n$ is the number of users and our mechanism in the local setting admits a Polynomial Time Approximation Scheme (PTAS).
    FairEdit: Preserving Fairness in Graph Neural Networks through Greedy Graph Editing. (arXiv:2201.03681v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have proven to excel in predictive modeling tasks where the underlying data is a graph. However, as GNNs are extensively used in human-centered applications, the issue of fairness has arisen. While edge deletion is a common method used to promote fairness in GNNs, it fails to consider when data is inherently missing fair connections. In this work we consider the unexplored method of edge addition, accompanied by deletion, to promote fairness. We propose two model-agnostic algorithms to perform edge editing: a brute force approach and a continuous approximation approach, FairEdit. FairEdit performs efficient edge editing by leveraging gradient information of a fairness loss to find edges that improve fairness. We find that FairEdit outperforms standard training for many data sets and GNN methods, while performing comparably to many state-of-the-art methods, demonstrating FairEdit's ability to improve fairness across many domains and models.
    Optimally compressing VC classes. (arXiv:2201.04131v1 [cs.LG])
    (1 min) Resolving a conjecture of Littlestone and Warmuth, we show that any concept class of VC-dimension $d$ has a sample compression scheme of size $d$.
    The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence. (arXiv:2201.03954v1 [cs.LG])
    (2 min) As the production of and reliance on datasets to produce automated decision-making systems (ADS) increases, so does the need for processes for evaluating and interrogating the underlying data. After launching the Dataset Nutrition Label in 2018, the Data Nutrition Project has made significant updates to the design and purpose of the Label, and is launching an updated Label in late 2020, which is previewed in this paper. The new Label includes context-specific Use Cases &Alerts presented through an updated design and user interface targeted towards the data scientist profile. This paper discusses the harm and bias from underlying training data that the Label is intended to mitigate, the current state of the work including new datasets being labeled, new and existing challenges, and further directions of the work, as well as Figures previewing the new label.
    VGAER: graph neural network reconstruction based community detection. (arXiv:2201.04066v1 [cs.SI])
    (2 min) Community detection is a fundamental and important issue in network science, but there are only a few community detection algorithms based on graph neural networks, among which unsupervised algorithms are almost blank. By fusing the high-order modularity information with network features, this paper proposes a Variational Graph AutoEncoder Reconstruction based community detection VGAER for the first time, and gives its non-probabilistic version. They do not need any prior information. We have carefully designed corresponding input features, decoder, and downstream tasks based on the community detection task and these designs are concise, natural, and perform well (NMI values under our design are improved by 59.1% - 565.9%). Based on a series of experiments with wide range of datasets and advanced methods, VGAER has achieved superior performance and shows strong competitiveness and potential with a simpler design. Finally, we report the results of algorithm convergence analysis and t-SNE visualization, which clearly depicted the stable performance and powerful network modularity ability of VGAER. Our codes are available at https://github.com/qcydm/VGAER.
    DANNTe: a case study of a turbo-machinery sensor virtualization under domain shift. (arXiv:2201.03850v1 [cs.LG])
    (2 min) We propose an adversarial learning method to tackle a Domain Adaptation (DA) time series regression task (DANNTe). The regression aims at building a virtual copy of a sensor installed on a gas turbine, to be used in place of the physical sensor which can be missing in certain situations. Our DA approach is to search for a domain-invariant representation of the features. The learner has access to both a labelled source dataset and an unlabeled target dataset (unsupervised DA) and is trained on both, exploiting the minmax game between a task regressor and a domain classifier Neural Networks. Both models share the same feature representation, learnt by a feature extractor. This work is based on the results published by Ganin et al. arXiv:1505.07818; indeed, we present an extension suitable to time series applications. We report a significant improvement in regression performance, compared to the baseline model trained on the source domain only.
    Winning solutions and post-challenge analyses of the ChaLearn AutoDL challenge 2019. (arXiv:2201.03801v1 [cs.LG])
    (2 min) This paper reports the results and post-challenge analyses of ChaLearn's AutoDL challenge series, which helped sorting out a profusion of AutoML solutions for Deep Learning (DL) that had been introduced in a variety of settings, but lacked fair comparisons. All input data modalities (time series, images, videos, text, tabular) were formatted as tensors and all tasks were multi-label classification problems. Code submissions were executed on hidden tasks, with limited time and computational resources, pushing solutions that get results quickly. In this setting, DL methods dominated, though popular Neural Architecture Search (NAS) was impractical. Solutions relied on fine-tuned pre-trained networks, with architectures matching data modality. Post-challenge tests did not reveal improvements beyond the imposed time limit. While no component is particularly original or novel, a high level modular organization emerged featuring a "meta-learner", "data ingestor", "model selector", "model/learner", and "evaluator". This modularity enabled ablation studies, which revealed the importance of (off-platform) meta-learning, ensembling, and efficient data management. Experiments on heterogeneous module combinations further confirm the (local) optimality of the winning solutions. Our challenge legacy includes an ever-lasting benchmark (this http URL), the open-sourced code of the winners, and a free "AutoDL self-service".
    Online Changepoint Detection on a Budget. (arXiv:2201.03710v1 [cs.LG])
    (2 min) Changepoints are abrupt variations in the underlying distribution of data. Detecting changes in a data stream is an important problem with many applications. In this paper, we are interested in changepoint detection algorithms which operate in an online setting in the sense that both its storage requirements and worst-case computational complexity per observation are independent of the number of previous observations. We propose an online changepoint detection algorithm for both univariate and multivariate data which compares favorably with offline changepoint detection algorithms while also operating in a strictly more constrained computational model. In addition, we present a simple online hyperparameter auto tuning technique for these algorithms.
    Feature Extraction Framework based on Contrastive Learning with Adaptive Positive and Negative Samples. (arXiv:2201.03942v1 [cs.LG])
    (2 min) In this study, we propose a feature extraction framework based on contrastive learning with adaptive positive and negative samples (CL-FEFA) that is suitable for unsupervised, supervised, and semi-supervised single-view feature extraction. CL-FEFA constructs adaptively the positive and negative samples from the results of feature extraction, which makes it more appropriate and accurate. Thereafter, the discriminative features are re extracted to according to InfoNCE loss based on previous positive and negative samples, which will make the intra-class samples more compact and the inter-class samples more dispersed. At the same time, using the potential structure information of subspace samples to dynamically construct positive and negative samples can make our framework more robust to noisy data. Furthermore, CL-FEFA considers the mutual information between positive samples, that is, similar samples in potential structures, which provides theoretical support for its advantages in feature extraction. The final numerical experiments prove that the proposed framework has a strong advantage over the traditional feature extraction methods and contrastive learning methods.
    PEPit: computer-assisted worst-case analyses of first-order optimization methods in Python. (arXiv:2201.04040v1 [math.OC])
    (2 min) PEPit is a Python package aiming at simplifying the access to worst-case analyses of a large family of first-order optimization methods possibly involving gradient, projection, proximal, or linear optimization oracles, along with their approximate, or Bregman variants. In short, PEPit is a package enabling computer-assisted worst-case analyses of first-order optimization methods. The key underlying idea is to cast the problem of performing a worst-case analysis, often referred to as a performance estimation problem (PEP), as a semidefinite program (SDP) which can be solved numerically. For doing that, the package users are only required to write first-order methods nearly as they would have implemented them. The package then takes care of the SDP modelling parts, and the worst-case analysis is performed numerically via a standard solver.
    Entropic Optimal Transport in Random Graphs. (arXiv:2201.03949v1 [stat.ML])
    (2 min) In graph analysis, a classic task consists in computing similarity measures between (groups of) nodes. In latent space random graphs, nodes are associated to unknown latent variables. One may then seek to compute distances directly in the latent space, using only the graph structure. In this paper, we show that it is possible to consistently estimate entropic-regularized Optimal Transport (OT) distances between groups of nodes in the latent space. We provide a general stability result for entropic OT with respect to perturbations of the cost matrix. We then apply it to several examples of random graphs, such as graphons or $\epsilon$-graphs on manifolds. Along the way, we prove new concentration results for the so-called Universal Singular Value Thresholding estimator, and for the estimation of geodesic distances on a manifold.
    Partial Model Averaging in Federated Learning: Performance Guarantees and Benefits. (arXiv:2201.03789v1 [cs.LG])
    (2 min) Local Stochastic Gradient Descent (SGD) with periodic model averaging (FedAvg) is a foundational algorithm in Federated Learning. The algorithm independently runs SGD on multiple workers and periodically averages the model across all the workers. When local SGD runs with many workers, however, the periodic averaging causes a significant model discrepancy across the workers making the global loss converge slowly. While recent advanced optimization methods tackle the issue focused on non-IID settings, there still exists the model discrepancy issue due to the underlying periodic model averaging. We propose a partial model averaging framework that mitigates the model discrepancy issue in Federated Learning. The partial averaging encourages the local models to stay close to each other on parameter space, and it enables to more effectively minimize the global loss. Given a fixed number of iterations and a large number of workers (128), the partial averaging achieves up to 2.2% higher validation accuracy than the periodic full averaging.
    Classification of Beer Bottles using Object Detection and Transfer Learning. (arXiv:2201.03791v1 [cs.CV])
    (2 min) Classification problems are common in Computer Vision. Despite this, there is no dedicated work for the classification of beer bottles. As part of the challenge of the master course Deep Learning, a dataset of 5207 beer bottle images and brand labels was created. An image contains exactly one beer bottle. In this paper we present a deep learning model which classifies pictures of beer bottles in a two step approach. As the first step, a Faster-R-CNN detects image sections relevant for classification independently of the brand. In the second step, the relevant image sections are classified by a ResNet-18. The image section with the highest confidence is returned as class label. We propose a model, with which we surpass the classic one step transfer learning approach and reached an accuracy of 99.86% during the challenge on the final test dataset. We were able to achieve 100% accuracy after the challenge ended
    Turkish Sentiment Analysis Using Machine Learning Methods: Application on Online Food Order Site Reviews. (arXiv:2201.03848v1 [cs.CL])
    (2 min) Satisfaction measurement, which emerges in every sector today, is a very important factor for many companies. In this study, it is aimed to reach the highest accuracy rate with various machine learning algorithms by using the data on Yemek Sepeti and variations of this data. The accuracy values of each algorithm were calculated together with the various natural language processing methods used. While calculating these accuracy values, the parameters of the algorithms used were tried to be optimized. The models trained in this study on labeled data can be used on unlabeled data and can give companies an idea in measuring customer satisfaction. It was observed that 3 different natural language processing methods applied resulted in approximately 5% accuracy increase in most of the developed models.
    Competing Mutual Information Constraints with Stochastic Competition-based Activations for Learning Diversified Representations. (arXiv:2201.03624v1 [cs.LG])
    (2 min) This work aims to address the long-established problem of learning diversified representations. To this end, we combine information-theoretic arguments with stochastic competition-based activations, namely Stochastic Local Winner-Takes-All (LWTA) units. In this context, we ditch the conventional deep architectures commonly used in Representation Learning, that rely on non-linear activations; instead, we replace them with sets of locally and stochastically competing linear units. In this setting, each network layer yields sparse outputs, determined by the outcome of the competition between units that are organized into blocks of competitors. We adopt stochastic arguments for the competition mechanism, which perform posterior sampling to determine the winner of each block. We further endow the considered networks with the ability to infer the sub-part of the network that is essential for modeling the data at hand; we impose appropriate stick-breaking priors to this end. To further enrich the information of the emerging representations, we resort to information-theoretic principles, namely the Information Competing Process (ICP). Then, all the components are tied together under the stochastic Variational Bayes framework for inference. We perform a thorough experimental investigation for our approach using benchmark datasets on image classification. As we experimentally show, the resulting networks yield significant discriminative representation learning abilities. In addition, the introduced paradigm allows for a principled investigation mechanism of the emerging intermediate network representations.
    Multi-granularity Relabeled Under-sampling Algorithm for Imbalanced Data. (arXiv:2201.03957v1 [cs.LG])
    (2 min) The imbalanced classification problem turns out to be one of the important and challenging problems in data mining and machine learning. The performances of traditional classifiers will be severely affected by many data problems, such as class imbalanced problem, class overlap and noise. The Tomek-Link algorithm was only used to clean data when it was proposed. In recent years, there have been reports of combining Tomek-Link algorithm with sampling technique. The Tomek-Link sampling algorithm can effectively reduce the class overlap on data, remove the majority instances that are difficult to distinguish, and improve the algorithm classification accuracy. However, the Tomek-Links under-sampling algorithm only considers the boundary instances that are the nearest neighbors to each other globally and ignores the potential local overlapping instances. When the number of minority instances is small, the under-sampling effect is not satisfactory, and the performance improvement of the classification model is not obvious. Therefore, on the basis of Tomek-Link, a multi-granularity relabeled under-sampling algorithm (MGRU) is proposed. This algorithm fully considers the local information of the data set in the local granularity subspace, and detects the local potential overlapping instances in the data set. Then, the overlapped majority instances are eliminated according to the global relabeled index value, which effectively expands the detection range of Tomek-Links. The simulation results show that when we select the optimal global relabeled index value for under-sampling, the classification accuracy and generalization performance of the proposed under-sampling algorithm are significantly better than other baseline algorithms.
    Captcha Attack:Turning Captchas Against Humanity. (arXiv:2201.04014v1 [cs.CR])
    (2 min) Nowadays, people generate and share massive content on online platforms (e.g., social networks, blogs). In 2021, the 1.9 billion daily active Facebook users posted around 150 thousand photos every minute. Content moderators constantly monitor these online platforms to prevent the spreading of inappropriate content (e.g., hate speech, nudity images). Based on deep learning (DL) advances, Automatic Content Moderators (ACM) help human moderators handle high data volume. Despite their advantages, attackers can exploit weaknesses of DL components (e.g., preprocessing, model) to affect their performance. Therefore, an attacker can leverage such techniques to spread inappropriate content by evading ACM. In this work, we propose CAPtcha Attack (CAPA), an adversarial technique that allows users to spread inappropriate text online by evading ACM controls. CAPA, by generating custom textual CAPTCHAs, exploits ACM's careless design implementations and internal procedures vulnerabilities. We test our attack on real-world ACM, and the results confirm the ferocity of our simple yet effective attack, reaching up to a 100% evasion success in most cases. At the same time, we demonstrate the difficulties in designing CAPA mitigations, opening new challenges in CAPTCHAs research area.
    A multi-scale sampling method for accurate and robust deep neural network to predict combustion chemical kinetics. (arXiv:2201.03549v1 [physics.chem-ph])
    (2 min) Machine learning has long been considered as a black box for predicting combustion chemical kinetics due to the extremely large number of parameters and the lack of evaluation standards and reproducibility. The current work aims to understand two basic questions regarding the deep neural network (DNN) method: what data the DNN needs and how general the DNN method can be. Sampling and preprocessing determine the DNN training dataset, further affect DNN prediction ability. The current work proposes using Box-Cox transformation (BCT) to preprocess the combustion data. In addition, this work compares different sampling methods with or without preprocessing, including the Monte Carlo method, manifold sampling, generative neural network method (cycle-GAN), and newly-proposed multi-scale sampling. Our results reveal that the DNN trained by the manifold data can capture the chemical kinetics in limited configurations but cannot remain robust toward perturbation, which is inevitable for the DNN coupled with the flow field. The Monte Carlo and cycle-GAN samplings can cover a wider phase space but fail to capture small-scale intermediate species, producing poor prediction results. A three-hidden-layer DNN, based on the multi-scale method without specific flame simulation data, allows predicting chemical kinetics in various scenarios and being stable during the temporal evolutions. This single DNN is readily implemented with several CFD codes and validated in various combustors, including (1). zero-dimensional autoignition, (2). one-dimensional freely propagating flame, (3). two-dimensional jet flame with triple-flame structure, and (4). three-dimensional turbulent lifted flames. The results demonstrate the satisfying accuracy and generalization ability of the pre-trained DNN. The Fortran and Python versions of DNN and example code are attached in the supplementary for reproducibility.
    Learning Fair Node Representations with Graph Counterfactual Fairness. (arXiv:2201.03662v1 [cs.LG])
    (2 min) Fair machine learning aims to mitigate the biases of model predictions against certain subpopulations regarding sensitive attributes such as race and gender. Among the many existing fairness notions, counterfactual fairness measures the model fairness from a causal perspective by comparing the predictions of each individual from the original data and the counterfactuals. In counterfactuals, the sensitive attribute values of this individual had been modified. Recently, a few works extend counterfactual fairness to graph data, but most of them neglect the following facts that can lead to biases: 1) the sensitive attributes of each node's neighbors may causally affect the prediction w.r.t. this node; 2) the sensitive attributes may causally affect other features and the graph structure. To tackle these issues, in this paper, we propose a novel fairness notion - graph counterfactual fairness, which considers the biases led by the above facts. To learn node representations towards graph counterfactual fairness, we propose a novel framework based on counterfactual data augmentation. In this framework, we generate counterfactuals corresponding to perturbations on each node's and their neighbors' sensitive attributes. Then we enforce fairness by minimizing the discrepancy between the representations learned from the original graph and the counterfactuals for each node. Experiments on both synthetic and real-world graphs show that our framework outperforms the state-of-the-art baselines in graph counterfactual fairness, and also achieves comparable prediction performance.
    Towards Group Robustness in the presence of Partial Group Labels. (arXiv:2201.03668v1 [cs.LG])
    (2 min) Learning invariant representations is an important requirement when training machine learning models that are driven by spurious correlations in the datasets. These spurious correlations, between input samples and the target labels, wrongly direct the neural network predictions resulting in poor performance on certain groups, especially the minority groups. Robust training against these spurious correlations requires the knowledge of group membership for every sample. Such a requirement is impractical in situations where the data labeling efforts for minority or rare groups are significantly laborious or where the individuals comprising the dataset choose to conceal sensitive information. On the other hand, the presence of such data collection efforts results in datasets that contain partially labeled group information. Recent works have tackled the fully unsupervised scenario where no labels for groups are available. Thus, we aim to fill the missing gap in the literature by tackling a more realistic setting that can leverage partially available sensitive or group information during training. First, we construct a constraint set and derive a high probability bound for the group assignment to belong to the set. Second, we propose an algorithm that optimizes for the worst-off group assignments from the constraint set. Through experiments on image and tabular datasets, we show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
    SpectraNet: Learned Recognition of Artificial Satellites From High Contrast Spectroscopic Imagery. (arXiv:2201.03614v1 [cs.LG])
    (2 min) Effective space traffic management requires positive identification of artificial satellites. Current methods for extracting object identification from observed data require spatially resolved imagery which limits identification to objects in low earth orbits. Most artificial satellites, however, operate in geostationary orbits at distances which prohibit ground based observatories from resolving spatial information. This paper demonstrates an object identification solution leveraging modified residual convolutional neural networks to map distance-invariant spectroscopic data to object identity. We report classification accuracies exceeding 80% for a simulated 64-class satellite problem--even in the case of satellites undergoing constant, random re-orientation. An astronomical observing campaign driven by these results returned accuracies of 72% for a nine-class problem with an average of 100 examples per class, performing as expected from simulation. We demonstrate the application of variational Bayesian inference by dropout, stochastic weight averaging (SWA), and SWA-focused deep ensembling to measure classification uncertainties--critical components in space traffic management where routine decisions risk expensive space assets and carry geopolitical consequences.
    Learning what to remember. (arXiv:2201.03806v1 [cs.LG])
    (2 min) We consider a lifelong learning scenario in which a learner faces a neverending and arbitrary stream of facts and has to decide which ones to retain in its limited memory. We introduce a mathematical model based on the online learning framework, in which the learner measures itself against a collection of experts that are also memory-constrained and that reflect different policies for what to remember. Interspersed with the stream of facts are occasional questions, and on each of these the learner incurs a loss if it has not remembered the corresponding fact. Its goal is to do almost as well as the best expert in hindsight, while using roughly the same amount of memory. We identify difficulties with using the multiplicative weights update algorithm in this memory-constrained scenario, and design an alternative scheme whose regret guarantees are close to the best possible.
    Reproducing BowNet: Learning Representations by Predicting Bags of Visual Words. (arXiv:2201.03556v1 [cs.CV])
    (2 min) This work aims to reproduce results from the CVPR 2020 paper by Gidaris et al. Self-supervised learning (SSL) is used to learn feature representations of an image using an unlabeled dataset. This work proposes to use bag-of-words (BoW) deep feature descriptors as a self-supervised learning target to learn robust, deep representations. BowNet is trained to reconstruct the histogram of visual words (ie. the deep BoW descriptor) of a reference image when presented a perturbed version of the image as input. Thus, this method aims to learn perturbation-invariant and context-aware image features that can be useful for few-shot tasks or supervised downstream tasks. In the paper, the author describes BowNet as a network consisting of a convolutional feature extractor $\Phi(\cdot)$ and a Dense-softmax layer $\Omega(\cdot)$ trained to predict BoW features from images. After BoW training, the features of $\Phi$ are used in downstream tasks. For this challenge we were trying to build and train a network that could reproduce the CIFAR-100 accuracy improvements reported in the original paper. However, we were unsuccessful in reproducing an accuracy improvement comparable to what the authors mentioned.
    Dictionary Learning with Uniform Sparse Representations for Anomaly Detection. (arXiv:2201.03869v1 [cs.LG])
    (2 min) Many applications like audio and image processing show that sparse representations are a powerful and efficient signal modeling technique. Finding an optimal dictionary that generates at the same time the sparsest representations of data and the smallest approximation error is a hard problem approached by dictionary learning (DL). We study how DL performs in detecting abnormal samples in a dataset of signals. In this paper we use a particular DL formulation that seeks uniform sparse representations model to detect the underlying subspace of the majority of samples in a dataset, using a K-SVD-type algorithm. Numerical simulations show that one can efficiently use this resulted subspace to discriminate the anomalies over the regular data points.
    Active Reinforcement Learning -- A Roadmap Towards Curious Classifier Systems for Self-Adaptation. (arXiv:2201.03947v1 [cs.LG])
    (2 min) Intelligent systems have the ability to improve their behaviour over time taking observations, experiences or explicit feedback into account. Traditional approaches separate the learning problem and make isolated use of techniques from different field of machine learning such as reinforcement learning, active learning, anomaly detection or transfer learning, for instance. In this context, the fundamental reinforcement learning approaches come with several drawbacks that hinder their application to real-world systems: trial-and-error, purely reactive behaviour or isolated problem handling. The idea of this article is to present a concept for alleviating these drawbacks by setting up a research agenda towards what we call "active reinforcement learning" in intelligent systems.
    Iterative RAKI with Complex-Valued Convolution for Improved Image Reconstruction with Limited Scan-Specific Training Samples. (arXiv:2201.03560v1 [eess.IV])
    (2 min) MRI scan time reduction is commonly achieved by Parallel Imaging methods, typically based on uniform undersampling of the inverse image space (a.k.a. k-space) and simultaneous signal reception with multiple receiver coils. The GRAPPA method interpolates missing k-space signals by linear combination of adjacent, acquired signals across all coils, and can be described by a convolution in k-space. Recently, a more generalized method called RAKI was introduced. RAKI is a deep-learning method that generalizes GRAPPA with additional convolution layers, on which a non-linear activation function is applied. This enables non-linear estimation of missing signals by convolutional neural networks. In analogy to GRAPPA, the convolution kernels in RAKI are trained using scan-specific training samples obtained from auto-calibration-signals (ACS). RAKI provides superior reconstruction quality compared to GRAPPA, however, often requires much more ACS due to its increased number of unknown parameters. In order to overcome this limitation, this study investigates the influence of training data on the reconstruction quality for standard 2D imaging, with particular focus on its amount and contrast information. Furthermore, an iterative k-space interpolation approach (iRAKI) is evaluated, which includes training data augmentation via an initial GRAPPA reconstruction, and refinement of convolution filters by iterative training. Using only 18, 20 and 25 ACS lines (8%), iRAKI outperforms RAKI by suppressing residual artefacts occurring at accelerations factors R=4 and R=5, and yields strong noise suppression in comparison to GRAPPA, underlined by quantitative quality metrics. Combination with a phase-constraint yields further improvement. Additionally, iRAKI shows better performance than GRAPPA and RAKI in case of pre-scan calibration and strongly varying contrast between training- and undersampled data.
    Evaluating Bayesian Model Visualisations. (arXiv:2201.03604v1 [cs.HC])
    (2 min) Probabilistic models inform an increasingly broad range of business and policy decisions ultimately made by people. Recent algorithmic, computational, and software framework development progress facilitate the proliferation of Bayesian probabilistic models, which characterise unobserved parameters by their joint distribution instead of point estimates. While they can empower decision makers to explore complex queries and to perform what-if-style conditioning in theory, suitable visualisations and interactive tools are needed to maximise users' comprehension and rational decision making under uncertainty. In this paper, propose a protocol for quantitative evaluation of Bayesian model visualisations and introduce a software framework implementing this protocol to support standardisation in evaluation practice and facilitate reproducibility. We illustrate the evaluation and analysis workflow on a user study that explores whether making Boxplots and Hypothetical Outcome Plots interactive can increase comprehension or rationality and conclude with design guidelines for researchers looking to conduct similar studies in the future.
    Path differentiability of ODE flows. (arXiv:2201.03819v1 [cs.LG])
    (2 min) We consider flows of ordinary differential equations (ODEs) driven by path differentiable vector fields. Path differentiable functions constitute a proper subclass of Lipschitz functions which admit conservative gradients, a notion of generalized derivative compatible with basic calculus rules. Our main result states that such flows inherit the path differentiability property of the driving vector field. We show indeed that forward propagation of derivatives given by the sensitivity differential inclusions provide a conservative Jacobian for the flow. This allows to propose a nonsmooth version of the adjoint method, which can be applied to integral costs under an ODE constraint. This result constitutes a theoretical ground to the application of small step first order methods to solve a broad class of nonsmooth optimization problems with parametrized ODE constraints. This is illustrated with the convergence of small step first order methods based on the proposed nonsmooth adjoint.
    Machine learning enabling high-throughput and remote operations at large-scale user facilities. (arXiv:2201.03550v1 [cs.LG])
    (2 min) Imaging, scattering, and spectroscopy are fundamental in understanding and discovering new functional materials. Contemporary innovations in automation and experimental techniques have led to these measurements being performed much faster and with higher resolution, thus producing vast amounts of data for analysis. These innovations are particularly pronounced at user facilities and synchrotron light sources. Machine learning (ML) methods are regularly developed to process and interpret large datasets in real-time with measurements. However, there remain conceptual barriers to entry for the facility general user community, whom often lack expertise in ML, and technical barriers for deploying ML models. Herein, we demonstrate a variety of archetypal ML models for on-the-fly analysis at multiple beamlines at the National Synchrotron Light Source II (NSLS-II). We describe these examples instructively, with a focus on integrating the models into existing experimental workflows, such that the reader can easily include their own ML techniques into experiments at NSLS-II or facilities with a common infrastructure. The framework presented here shows how with little effort, diverse ML models operate in conjunction with feedback loops via integration into the existing Bluesky Suite for experimental orchestration and data management.
    A Physics-Informed Vector Quantized Autoencoder for Data Compression of Turbulent Flow. (arXiv:2201.03617v1 [physics.flu-dyn])
    (2 min) Analyzing large-scale data from simulations of turbulent flows is memory intensive, requiring significant resources. This major challenge highlights the need for data compression techniques. In this study, we apply a physics-informed Deep Learning technique based on vector quantization to generate a discrete, low-dimensional representation of data from simulations of three-dimensional turbulent flows. The deep learning framework is composed of convolutional layers and incorporates physical constraints on the flow, such as preserving incompressibility and global statistical characteristics of the velocity gradients. The accuracy of the model is assessed using statistical, comparison-based similarity and physics-based metrics. The training data set is produced from Direct Numerical Simulation of an incompressible, statistically stationary, isotropic turbulent flow. The performance of this lossy data compression scheme is evaluated not only with unseen data from the stationary, isotropic turbulent flow, but also with data from decaying isotropic turbulence, and a Taylor-Green vortex flow. Defining the compression ratio (CR) as the ratio of original data size to the compressed one, the results show that our model based on vector quantization can offer CR $=85$ with a mean square error (MSE) of $O(10^{-3})$, and predictions that faithfully reproduce the statistics of the flow, except at the very smallest scales where there is some loss. Compared to the recent study based on a conventional autoencoder where compression is performed in a continuous space, our model improves the CR by more than $30$ percent, and reduces the MSE by an order of magnitude. Our compression model is an attractive solution for situations where fast, high quality and low-overhead encoding and decoding of large data are required.
    Stratified Graph Spectra. (arXiv:2201.03696v1 [cs.LG])
    (2 min) In classic graph signal processing, given a real-valued graph signal, its graph Fourier transform is typically defined as the series of inner products between the signal and each eigenvector of the graph Laplacian. Unfortunately, this definition is not mathematically valid in the cases of vector-valued graph signals which however are typical operands in the state-of-the-art graph learning modeling and analyses. Seeking a generalized transformation decoding the magnitudes of eigencomponents from vector-valued signals is thus the main objective of this paper. Several attempts are explored, and also it is found that performing the transformation at hierarchical levels of adjacency help profile the spectral characteristics of signals more insightfully. The proposed methods are introduced as a new tool assisting on diagnosing and profiling behaviors of graph learning models.
    NFANet: A Novel Method for Weakly Supervised Water Extraction from High-Resolution Remote Sensing Imagery. (arXiv:2201.03686v1 [cs.CV])
    (2 min) The use of deep learning for water extraction requires precise pixel-level labels. However, it is very difficult to label high-resolution remote sensing images at the pixel level. Therefore, we study how to utilize point labels to extract water bodies and propose a novel method called the neighbor feature aggregation network (NFANet). Compared with pixellevel labels, point labels are much easier to obtain, but they will lose much information. In this paper, we take advantage of the similarity between the adjacent pixels of a local water-body, and propose a neighbor sampler to resample remote sensing images. Then, the sampled images are sent to the network for feature aggregation. In addition, we use an improved recursive training algorithm to further improve the extraction accuracy, making the water boundary more natural. Furthermore, our method utilizes neighboring features instead of global or local features to learn more representative features. The experimental results show that the proposed NFANet method not only outperforms other studied weakly supervised approaches, but also obtains similar results as the state-of-the-art ones.
    An Accelerator for Rule Induction in Fuzzy Rough Theory. (arXiv:2201.03649v1 [cs.AI])
    (2 min) Rule-based classifier, that extract a subset of induced rules to efficiently learn/mine while preserving the discernibility information, plays a crucial role in human-explainable artificial intelligence. However, in this era of big data, rule induction on the whole datasets is computationally intensive. So far, to the best of our knowledge, no known method focusing on accelerating rule induction has been reported. This is first study to consider the acceleration technique to reduce the scale of computation in rule induction. We propose an accelerator for rule induction based on fuzzy rough theory; the accelerator can avoid redundant computation and accelerate the building of a rule classifier. First, a rule induction method based on consistence degree, called Consistence-based Value Reduction (CVR), is proposed and used as basis to accelerate. Second, we introduce a compacted search space termed Key Set, which only contains the key instances required to update the induced rule, to conduct value reduction. The monotonicity of Key Set ensures the feasibility of our accelerator. Third, a rule-induction accelerator is designed based on Key Set, and it is theoretically guaranteed to display the same results as the unaccelerated version. Specifically, the rank preservation property of Key Set ensures consistency between the rule induction achieved by the accelerator and the unaccelerated method. Finally, extensive experiments demonstrate that the proposed accelerator can perform remarkably faster than the unaccelerated rule-based classifier methods, especially on datasets with numerous instances.
    Demonstrating The Risk of Imbalanced Datasets in Chest X-ray Image-based Diagnostics by Prototypical Relevance Propagation. (arXiv:2201.03559v1 [eess.IV])
    (2 min) The recent trend of integrating multi-source Chest X-Ray datasets to improve automated diagnostics raises concerns that models learn to exploit source-specific correlations to improve performance by recognizing the source domain of an image rather than the medical pathology. We hypothesize that this effect is enforced by and leverages label-imbalance across the source domains, i.e, prevalence of a disease corresponding to a source. Therefore, in this work, we perform a thorough study of the effect of label-imbalance in multi-source training for the task of pneumonia detection on the widely used ChestX-ray14 and CheXpert datasets. The results highlight and stress the importance of using more faithful and transparent self-explaining models for automated diagnosis, thus enabling the inherent detection of spurious learning. They further illustrate that this undesirable effect of learning spurious correlations can be reduced considerably when ensuring label-balanced source domain datasets.
    Pavlovian Signalling with General Value Functions in Agent-Agent Temporal Decision Making. (arXiv:2201.03709v1 [cs.AI])
    (2 min) In this paper, we contribute a multi-faceted study into Pavlovian signalling -- a process by which learned, temporally extended predictions made by one agent inform decision-making by another agent. Signalling is intimately connected to time and timing. In service of generating and receiving signals, humans and other animals are known to represent time, determine time since past events, predict the time until a future stimulus, and both recognize and generate patterns that unfold in time. We investigate how different temporal processes impact coordination and signalling between learning agents by introducing a partially observable decision-making domain we call the Frost Hollow. In this domain, a prediction learning agent and a reinforcement learning agent are coupled into a two-part decision-making system that works to acquire sparse reward while avoiding time-conditional hazards. We evaluate two domain variations: machine agents interacting in a seven-state linear walk, and human-machine interaction in a virtual-reality environment. Our results showcase the speed of learning for Pavlovian signalling, the impact that different temporal representations do (and do not) have on agent-agent coordination, and how temporal aliasing impacts agent-agent and human-agent interactions differently. As a main contribution, we establish Pavlovian signalling as a natural bridge between fixed signalling paradigms and fully adaptive communication learning between two agents. We further show how to computationally build this adaptive signalling process out of a fixed signalling process, characterized by fast continual prediction learning and minimal constraints on the nature of the agent receiving signals. Our results therefore suggest an actionable, constructivist path towards communication learning between reinforcement learning agents.
  • stat.ML updates on arXiv.org

    Meta-Learning Reliable Priors in the Function Space. (arXiv:2106.03195v2 [cs.LG] UPDATED)
    (2 min) When data are scarce meta-learning can improve a learner's accuracy by harnessing previous experience from related learning tasks. However, existing methods have unreliable uncertainty estimates which are often overconfident. Addressing these shortcomings, we introduce a novel meta-learning framework, called F-PACOH, that treats meta-learned priors as stochastic processes and performs meta-level regularization directly in the function space. This allows us to directly steer the probabilistic predictions of the meta-learner towards high epistemic uncertainty in regions of insufficient meta-training data and, thus, obtain well-calibrated uncertainty estimates. Finally, we showcase how our approach can be integrated with sequential decision making, where reliable uncertainty quantification is imperative. In our benchmark study on meta-learning for Bayesian Optimization (BO), F-PACOH significantly outperforms all other meta-learners and standard baselines.
    GemNet: Universal Directional Graph Neural Networks for Molecules. (arXiv:2106.08903v6 [physics.comp-ph] UPDATED)
    (2 min) Effectively predicting molecular interactions has the potential to accelerate molecular dynamics by multiple orders of magnitude and thus revolutionize chemical simulations. Graph neural networks (GNNs) have recently shown great successes for this task, overtaking classical methods based on fixed molecular kernels. However, they still appear very limited from a theoretical perspective, since regular GNNs cannot distinguish certain types of graphs. In this work we close this gap between theory and practice. We show that GNNs with directed edge embeddings and two-hop message passing are indeed universal approximators for predictions that are invariant to translation, and equivariant to permutation and rotation. We then leverage these insights and multiple structural improvements to propose the geometric message passing neural network (GemNet). We demonstrate the benefits of the proposed changes in multiple ablation studies. GemNet outperforms previous models on the COLL, MD17, and OC20 datasets by 34%, 41%, and 20%, respectively, and performs especially well on the most challenging molecules. Our implementation is available online.
    Robust Generalised Bayesian Inference for Intractable Likelihoods. (arXiv:2104.07359v3 [stat.ME] UPDATED)
    (2 min) Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible mis-specification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. On a theoretical level, we show consistency, asymptotic normality, and bias-robustness of the generalised posterior, highlighting how these properties are impacted by the choice of Stein discrepancy. Then, we provide numerical experiments on a range of intractable distributions, including applications to kernel-based exponential family models and non-Gaussian graphical models.
    Bayesian imaging using Plug & Play priors: when Langevin meets Tweedie. (arXiv:2103.04715v5 [stat.ME] UPDATED)
    (3 min) Since the seminal work of Venkatakrishnan et al. (2013), Plug & Play (PnP) methods have become ubiquitous in Bayesian imaging. These methods derive Minimum Mean Square Error (MMSE) or Maximum A Posteriori (MAP) estimators for inverse problems in imaging by combining an explicit likelihood function with a prior that is implicitly defined by an image denoising algorithm. The PnP algorithms proposed in the literature mainly differ in the iterative schemes they use for optimisation or for sampling. In the case of optimisation schemes, some recent works guarantee the convergence to a fixed point, albeit not necessarily a MAP estimate. In the case of sampling schemes, to the best of our knowledge, there is no known proof of convergence. There also remain important open questions regarding whether the underlying Bayesian models and estimators are well defined, well-posed, and have the basic regularity properties required to support these numerical schemes. To address these limitations, this paper develops theory, methods, and provably convergent algorithms for performing Bayesian inference with PnP priors. We introduce two algorithms: 1) PnP-ULA (Unadjusted Langevin Algorithm) for Monte Carlo sampling and MMSE inference; and 2) PnP-SGD (Stochastic Gradient Descent) for MAP inference. Using recent results on the quantitative convergence of Markov chains, we establish detailed convergence guarantees for these two algorithms under realistic assumptions on the denoising operators used, with special attention to denoisers based on deep neural networks. We also show that these algorithms approximately target a decision-theoretically optimal Bayesian model that is well-posed. The proposed algorithms are demonstrated on several canonical problems such as image deblurring, inpainting, and denoising, where they are used for point estimation as well as for uncertainty visualisation and quantification.
    Optimal Thinning of MCMC Output. (arXiv:2005.03952v5 [stat.ME] UPDATED)
    (2 min) The use of heuristics to assess the convergence and compress the output of Markov chain Monte Carlo can be sub-optimal in terms of the empirical approximations that are produced. Typically a number of the initial states are attributed to "burn in" and removed, whilst the remainder of the chain is "thinned" if compression is also required. In this paper we consider the problem of retrospectively selecting a subset of states, of fixed cardinality, from the sample path such that the approximation provided by their empirical distribution is close to optimal. A novel method is proposed, based on greedy minimisation of a kernel Stein discrepancy, that is suitable for problems where heavy compression is required. Theoretical results guarantee consistency of the method and its effectiveness is demonstrated in the challenging context of parameter inference for ordinary differential equations. Software is available in the Stein Thinning package in Python, R and MATLAB.
    Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity. (arXiv:2004.12908v12 [cs.AI] UPDATED)
    (3 min) In biological learning, data are used to improve performance not only on the current task, but also on previously encountered, and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called catastrophic forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of lifelong learning, whether biological or artificial, should be to improve performance on both past and future tasks with any new data. Our key insight is that we can ensemble representations learned independently across tasks to transfer omnidirectionally, that is, jointly improve performance on both future unforeseen tasks (forward transfer) and past tasks (backward transfer). Specifically, we propose two lifelong learners: one ensembling trees and the other ensembling networks. In both cases, the algorithms are able to learn synergistically, improving performance on both past and future tasks in a variety of simulated and real data scenarios, including tabular data, image data, spoken data, and adversarial tasks. Moreover, they can do so with quasilinear space and time complexity, a requirement for any bona fide lifelong learning system.
    Learning what to remember. (arXiv:2201.03806v1 [cs.LG])
    (2 min) We consider a lifelong learning scenario in which a learner faces a neverending and arbitrary stream of facts and has to decide which ones to retain in its limited memory. We introduce a mathematical model based on the online learning framework, in which the learner measures itself against a collection of experts that are also memory-constrained and that reflect different policies for what to remember. Interspersed with the stream of facts are occasional questions, and on each of these the learner incurs a loss if it has not remembered the corresponding fact. Its goal is to do almost as well as the best expert in hindsight, while using roughly the same amount of memory. We identify difficulties with using the multiplicative weights update algorithm in this memory-constrained scenario, and design an alternative scheme whose regret guarantees are close to the best possible.
    Do We Exploit all Information for Counterfactual Analysis? Benefits of Factor Models and Idiosyncratic Correction. (arXiv:2011.03996v3 [econ.EM] UPDATED)
    (2 min) Optimal pricing, i.e., determining the price level that maximizes profit or revenue of a given product, is a vital task for the retail industry. To select such a quantity, one needs first to estimate the price elasticity from the product demand. Regression methods usually fail to recover such elasticities due to confounding effects and price endogeneity. Therefore, randomized experiments are typically required. However, elasticities can be highly heterogeneous depending on the location of stores, for example. As the randomization frequently occurs at the municipal level, standard difference-in-differences methods may also fail. Possible solutions are based on methodologies to measure the effects of treatments on a single (or just a few) treated unit(s) based on counterfactuals constructed from artificial controls. For example, for each city in the treatment group, a counterfactual may be constructed from the untreated locations. In this paper, we apply a novel high-dimensional statistical method to measure the effects of price changes on daily sales from a major retailer in Brazil. The proposed methodology combines principal components (factors) and sparse regressions, resulting in a method called Factor-Adjusted Regularized Method for Treatment evaluation (\texttt{FarmTreat}). The data consist of daily sales and prices of five different products over more than 400 municipalities. The products considered belong to the \emph{sweet and candies} category and experiments have been conducted over the years of 2016 and 2017. Our results confirm the hypothesis of a high degree of heterogeneity yielding very different pricing strategies over distinct municipalities.
    The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks. (arXiv:2010.08488v4 [stat.ML] UPDATED)
    (2 min) Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate Gaussian process covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited Gaussian process in the output space of the network. In contrast to existing work on the connection between neural networks and Gaussian processes, our analysis is non-asymptotic, with finite sample-size error bounds provided. This establishes the universality property that a Bayesian neural network can approximate any Gaussian process whose covariance function is sufficiently regular. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which a suitable Gaussian process prior can be provided.
    M2: Mixed Models with Preferences, Popularities and Transitions for Next-Basket Recommendation. (arXiv:2004.01646v3 [cs.LG] UPDATED)
    (2 min) Next-basket recommendation considers the problem of recommending a set of items into the next basket that users will purchase as a whole. In this paper, we develop a novel mixed model with preferences, popularities and transitions (M2) for the next-basket recommendation. This method models three important factors in next-basket generation process: 1) users' general preferences, 2) items' global popularities and 3) transition patterns among items. Unlike existing recurrent neural network-based approaches, M2 does not use the complicated networks to model the transitions among items, or generate embeddings for users. Instead, it has a simple encoder-decoder based approach (ed-Trans) to better model the transition patterns among items. We compared M2 with different combinations of the factors with 5 state-of-the-art next-basket recommendation methods on 4 public benchmark datasets in recommending the first, second and third next basket. Our experimental results demonstrate that M2 significantly outperforms the state-of-the-art methods on all the datasets in all the tasks, with an improvement of up to 22.1%. In addition, our ablation study demonstrates that the ed-Trans is more effective than recurrent neural networks in terms of the recommendation performance. We also have a thorough discussion on various experimental protocols and evaluation metrics for next-basket recommendation evaluation.
    A general class of surrogate functions for stable and efficient reinforcement learning. (arXiv:2108.05828v2 [cs.LG] UPDATED)
    (2 min) Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
    Competing Mutual Information Constraints with Stochastic Competition-based Activations for Learning Diversified Representations. (arXiv:2201.03624v1 [cs.LG])
    (2 min) This work aims to address the long-established problem of learning diversified representations. To this end, we combine information-theoretic arguments with stochastic competition-based activations, namely Stochastic Local Winner-Takes-All (LWTA) units. In this context, we ditch the conventional deep architectures commonly used in Representation Learning, that rely on non-linear activations; instead, we replace them with sets of locally and stochastically competing linear units. In this setting, each network layer yields sparse outputs, determined by the outcome of the competition between units that are organized into blocks of competitors. We adopt stochastic arguments for the competition mechanism, which perform posterior sampling to determine the winner of each block. We further endow the considered networks with the ability to infer the sub-part of the network that is essential for modeling the data at hand; we impose appropriate stick-breaking priors to this end. To further enrich the information of the emerging representations, we resort to information-theoretic principles, namely the Information Competing Process (ICP). Then, all the components are tied together under the stochastic Variational Bayes framework for inference. We perform a thorough experimental investigation for our approach using benchmark datasets on image classification. As we experimentally show, the resulting networks yield significant discriminative representation learning abilities. In addition, the introduced paradigm allows for a principled investigation mechanism of the emerging intermediate network representations.
    Towards Group Robustness in the presence of Partial Group Labels. (arXiv:2201.03668v1 [cs.LG])
    (2 min) Learning invariant representations is an important requirement when training machine learning models that are driven by spurious correlations in the datasets. These spurious correlations, between input samples and the target labels, wrongly direct the neural network predictions resulting in poor performance on certain groups, especially the minority groups. Robust training against these spurious correlations requires the knowledge of group membership for every sample. Such a requirement is impractical in situations where the data labeling efforts for minority or rare groups are significantly laborious or where the individuals comprising the dataset choose to conceal sensitive information. On the other hand, the presence of such data collection efforts results in datasets that contain partially labeled group information. Recent works have tackled the fully unsupervised scenario where no labels for groups are available. Thus, we aim to fill the missing gap in the literature by tackling a more realistic setting that can leverage partially available sensitive or group information during training. First, we construct a constraint set and derive a high probability bound for the group assignment to belong to the set. Second, we propose an algorithm that optimizes for the worst-off group assignments from the constraint set. Through experiments on image and tabular datasets, we show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
    Entropic Optimal Transport in Random Graphs. (arXiv:2201.03949v1 [stat.ML])
    (2 min) In graph analysis, a classic task consists in computing similarity measures between (groups of) nodes. In latent space random graphs, nodes are associated to unknown latent variables. One may then seek to compute distances directly in the latent space, using only the graph structure. In this paper, we show that it is possible to consistently estimate entropic-regularized Optimal Transport (OT) distances between groups of nodes in the latent space. We provide a general stability result for entropic OT with respect to perturbations of the cost matrix. We then apply it to several examples of random graphs, such as graphons or $\epsilon$-graphs on manifolds. Along the way, we prove new concentration results for the so-called Universal Singular Value Thresholding estimator, and for the estimation of geodesic distances on a manifold.
    Partial Model Averaging in Federated Learning: Performance Guarantees and Benefits. (arXiv:2201.03789v1 [cs.LG])
    (2 min) Local Stochastic Gradient Descent (SGD) with periodic model averaging (FedAvg) is a foundational algorithm in Federated Learning. The algorithm independently runs SGD on multiple workers and periodically averages the model across all the workers. When local SGD runs with many workers, however, the periodic averaging causes a significant model discrepancy across the workers making the global loss converge slowly. While recent advanced optimization methods tackle the issue focused on non-IID settings, there still exists the model discrepancy issue due to the underlying periodic model averaging. We propose a partial model averaging framework that mitigates the model discrepancy issue in Federated Learning. The partial averaging encourages the local models to stay close to each other on parameter space, and it enables to more effectively minimize the global loss. Given a fixed number of iterations and a large number of workers (128), the partial averaging achieves up to 2.2% higher validation accuracy than the periodic full averaging.
    An Embedding Framework for Consistent Polyhedral Surrogates. (arXiv:1907.07330v2 [cs.LG] UPDATED)
    (2 min) We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g.\ rankings) as a point in $\mathbb{R}^d$, assigns the original loss values to these points, and "convexifies" the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses. Given any polyhedral loss $L$, we give a construction of a link function through which $L$ is a consistent surrogate for the loss it embeds. Conversely, we show how to construct a consistent polyhedral surrogate for any given discrete loss. Our framework yields succinct proofs of consistency or inconsistency of various polyhedral surrogates in the literature, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We show some additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates.
  • Google AI Blog

    Scaling Vision with Sparse Mixture of Experts
    (8 min) Posted by Carlos Riquelme, Research Scientist and Joan Puigcerver, Software Engineer, Google Research, Brain team Advances in deep learning over the last few decades have been driven by a few key elements. With a small number of simple but flexible mechanisms (i.e., inductive biases such as convolutions or sequence attention), increasingly large datasets, and more specialized hardware, neural networks can now achieve impressive results on a wide range of tasks, such as image classification, machine translation, and protein folding prediction. However, the use of large models and datasets comes at the expense of significant computational requirements. Yet, recent works suggest that large model sizes might be necessary for strong generalization and robustness, so training large models whil…
  • The Official NVIDIA Blog

    How Retailers Meet Tough Challenges Using NVIDIA AI
    (4 min) At the National Retail Federation’s annual trade show, conversations tend to touch on recurring themes: “Will we be able to stock must-have products for next Christmas?,” “What incentives can I offer to loyal workers?” and “What happens to my margins if Susie Consumer purchases three of the same dresses online and returns two?” The $26 Read article > The post How Retailers Meet Tough Challenges Using NVIDIA AI  appeared first on The Official NVIDIA Blog.
    AI Startup to Take a Bite Out of Fast-Food Labor Crunch
    (3 min) Addressing a growing labor crisis among quick-service restaurants, startup Vistry is harnessing AI to automate the process of taking orders. The company will share its story at the NRF Big Show, the annual industry gathering of the National Retail Federation in New York, starting Jan. 16. “They’re closing restaurants because there is not enough labor,” Read article > The post AI Startup to Take a Bite Out of Fast-Food Labor Crunch appeared first on The Official NVIDIA Blog.
    GFN Thursday: ‘Fortnite’ Comes to iOS Safari and Android Through NVIDIA GeForce NOW via Closed Beta
    (3 min) Starting next week, Fortnite on GeForce NOW will launch in a limited-time closed beta for mobile, all streamed through the Safari web browser on iOS and the GeForce NOW Android app. The beta is open for registration for all GeForce NOW members, and will help test our server capacity, graphics delivery and new touch controls Read article > The post GFN Thursday: ‘Fortnite’ Comes to iOS Safari and Android Through NVIDIA GeForce NOW via Closed Beta appeared first on The Official NVIDIA Blog.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Do stochastic parrots understand what they recite?
    (6 min) There had never been a time like this before in the history of Artificial Intelligence (AI) where AI was this democratic and celebrated…
    Jump start in object detection with Detectron2 (license plate detection)
    (10 min) A very practical guide to build your first object detection model, jump from rough idea to proof-of-concept in one day
  • John D. Cook

    Self-Orthogonal Latin Squares
    (3 min) The other day I wrote about orthogonal Latin squares. Two Latin squares are orthogonal if the list of pairs of corresponding elements in the two squares contains no repeats. A self-orthogonal Latin square (SOLS) is a Latin square that is orthogonal to its transpose. Here’s an example of a self-orthogonal Latin square: 1 7 6 […] Self-Orthogonal Latin Squares first appeared on John D. Cook.
    Better parts, worse system
    (2 min) There’s a rule of systems thinking that improving part of a system can often make the system as a whole worse. One example of this is Braess’ Paradox that adding roads can make traffic worse, and closing roads can improve traffic. Another example is the paradox of enrichment: Increasing the food available to an ecosystem […] Better parts, worse system first appeared on John D. Cook.
  • Artificial Intelligence

    Software for generating new voices; Not voice cloning.
    (1 min) There's plenty of software out there that can clone another person's voice. And there's even some software that acts as a voice changer, mapping another person's voice to your performance. What I would like is voice changing software that allows you to edit and save a fully custom voice. It could be used for personal animation and video game projects. ​ There is the concern of how this might effect the careers of people in the voice acting business. . . But I figure that making a professional acting performance would still be a valuable service. And either way, I find it might be equally questionable to restrict the advancement of technology to preserve anyone's monopoly. submitted by /u/BladeManEXE7 [link] [comments]
    I'm investigating the impact of the implementation of AI in the rail industry. I would appreciate if you could answer a simple survey.
    (1 min) Hello everyone, I would appreciate if you could answer this survey. It is anonymous and takes about 3 minutes. LINK: https://docs.google.com/forms/d/1HxsVggCMTXtKGP3ov9lK3C6gyS9faN9Q0EtwFYqEwqY Thank you! submitted by /u/Adventurous_Wall6596 [link] [comments]
    whats the best text i could record for a ai TTS bot.
    (0 min) i want to make a realist tts bot of my voice and i just need some good text to read out and record for my ai. i just dont know what should i record. any tips? submitted by /u/Born-Macaroon-7610 [link] [comments]
    discord ai stock bot with auto-moderation and headline analysis- [buni]x, the based universal natty instrument x =]
    (0 min) submitted by /u/ungKoda [link] [comments]
    Equations for computing true positives and false positives when using object detection algorithms?
    (1 min) I am running some evaluation metrics using the YOLOv5 object detection algorithm, and wish to calculate my true positives and false positives. For instance, the evaluation metric outputs are as follows: Class Images Labels Prec Recall mAP@.5 mAP@.5:.95: all 100 36 0.444 0.702 0.481 0.223 Class 1 50 29 0.588 0.689 0.668 0.333 Class 2 50 7 0.301 0.714 0.293 0.113 Looking at this source (https://github.com/ultralytics/yolov5/issues/5713), I found that you could calculate the true positives and false positives with the following equations: ``` Computed for Class 1 TP = Recall * Labels = 34.45 ≈ 34 FP = (TP / Precision) - TP = 23.82 ≈ 24 ``` I am new to evaluation metrics, so at first glance, I'm thinking that the false positive number is fairly high. Is this the correct formula to compute the true positives and false positives? I'm just looking for some verification and some explanation as to why this works, if it does. submitted by /u/EnvironmentalLemon36 [link] [comments]
    Top Emerging Deep Learning Trends For 2022
    (0 min) submitted by /u/techsucker [link] [comments]
    [R] Facebook AI & UC Berkeley’s ConvNeXts Compete Favourably With SOTA Hierarchical ViTs on CV Benchmarks
    (1 min) A team from Facebook AI Research and UC Berkeley proposes ConvNeXts, a pure ConvNet model that achieves performance comparable with state-of-the-art hierarchical vision transformers on computer vision benchmarks while retaining the simplicity and efficiency of standard ConvNets. Here is a quick read: Facebook AI & UC Berkeley’s ConvNeXts Compete Favourably With SOTA Hierarchical ViTs on CV Benchmarks. The ConvNeXt code is available on the project’s GitHub. The paper A ConvNet for the 2020s is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Top 10 Face Detection APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    AI did some good work for my prompt "Gothic Dream"
    (1 min) submitted by /u/snowpixelapp [link] [comments]
    The age of AI-ism
    (0 min) submitted by /u/bendee983 [link] [comments]
    Peril of Human-Like Artificial Intelligence
    (1 min) arXiv:2201.04200 Date: Tue, 11 Jan 2022 21:07:17 GMT (437kb) Title: The Turing Trap: The Promise & Peril of Human-Like Artificial Intelligence Authors: Erik Brynjolfsson Categories: econ.GN cs.AI cs.CY cs.LG q-fin.EC Comments: Forthcoming in Daedalus, April 2022. Posted with permission MSC-class: 91 ACM-class: K.4.0; K.4.2 \\ In 1950, Alan Turing proposed an imitation game as the ultimate test of whether a machine was intelligent: could a machine imitate a human so well that its answers to questions indistinguishable from a human. Ever since, creating intelligence that matches human intelligence has implicitly or explicitly been the goal of thousands of researchers, engineers, and entrepreneurs. The benefits of human-like artificial intelligence (HLAI) include soaring productivity, increased leisure, and perhaps most profoundly, a better understanding of our own minds. But not all types of AI are human-like. In fact, many of the most powerful systems are very different from humans. So an excessive focus on developing and deploying HLAI can lead us into a trap. As machines become better substitutes for human labor, workers lose economic and political bargaining power and become increasingly dependent on those who control the technology. In contrast, when AI is focused on augmenting humans rather than mimicking them, then humans retain the power to insist on a share of the value created. Furthermore, augmentation creates new capabilities and new products and services, ultimately generating far more value than merely human-like AI. While both types of AI can be enormously beneficial, there are currently excess incentives for automation rather than augmentation among technologists, business executives, and policymakers. \\ ( https://arxiv.org/abs/2201.04200 , 437kb) submitted by /u/kg4jxt [link] [comments]
    A curated list of Speech/NLU SaaS companies
    (1 min) Hi, I'm the maker of SpeechPro, a niche job board for Speech AI tech professionals. I've just added a page listing all the Speech/NLU SaaS companies. SpeechPro company list screenshot It's a growing list and I'll keep updating it. Right now there are 50 and most of them are ASR/TTS/NLU/CAI related. Considering that ASR and TTS are essential parts for the Human-Machine interface, Speech SaaS companies might gain more popularity in this #Metaverse trend. Hope this company list can be helpful. Thanks. submitted by /u/david_swagger [link] [comments]
    Solving CAPTCHAs With Machine Learning to Enable Dark Web Research
    (1 min) submitted by /u/DaveBowman1975 [link] [comments]
  • Machine Learning

    [N] If you are interested in graph representation learning with Graph Neural Networks, then read on
    (1 min) Hi r/machinelearning, For the past 1.5 years I have been organizing an online journal club on the topic of Graph Representation Learning. We meet either weekly or fortnightly via Zoom to discuss a relevant paper. We are a small and friendly group and we would like to invite others who have similar interests to join us. We meet on Thursdays, 6:00pm-7:30pm, Canada/Pacific timezone, and out next meeting is on January 20, 2022. You are welcome to join us here. Cheers! submitted by /u/YodaML [link] [comments]
    [P] Colab Notebook to create AI-Generated Pokémon in 2 clicks
    (1 min) https://colab.research.google.com/drive/1A3t2gQofQGeXo5z1BAr1zqYaqVg3czKd?usp=sharing Last month I did an experiment with AI-Generated Pokémon on a fine trained ruDALL-E that went unexpectedly viral. Today, I've open-sourced the finetuned model and released a Colab Notebook that lets you generate Pokemon with just two clicks. Also, I added an infinite generation feature which works surprisingly well! submitted by /u/minimaxir [link] [comments]
    Trouble Modelling Heavy-Tailed Distributions [Research]
    (2 min) I'm currently working a problem where I'm attempting to reconstruct 1x600 long Fourier vectors using an Autoencoder, where each element in the vector is a frequency (increasing sequentially). I have a set of 17,000 spectra which I apply Numpy's rfft to obtain the corresponding Fourier vector. After transforming all of my data, I use an sklearn StandardScaler() to Standardize each column (frequency). From here, I am training my NN on this Standardized data. I have created plots of two samples from my data for illustration purposes. ​ https://preview.redd.it/w357enqdhib81.png?width=1526&format=png&auto=webp&s=b34d65ada9d241614b22944b2648cc60845f801c ​ https://preview.redd.it/96d3i8cfhib81.png?width=1526&format=png&auto=webp&s=1f89c75902685cb500271b2235fc382e584e0d09 You can see that in …
    [D] Solo machine learning engineer woes
    (2 min) I work at a start-up as the sole machine learning engineer, and I have been tasked to develop a state-of-the-art model on a domain specific problem. Some information on the task: Computer vision problem - complex keypoint localisation / 3D mesh prediction No training data - I'm proposing that I build a synthetic data pipeline supplemented with real-world data. I feel like there is too much for me to be doing at once: reading papers, writing plans and documentation, implementing code and training models, reporting to management, etc. My boss understands that research takes a long time. Nevertheless, I feel like management does not fully grasp that results on this are not certain or even likely. Even though the work environment is good, I am not handling the uncertainty around results well mentally. I wanted to ask, in your experience, what is a typical team size for this kind of job? And do you have any advice? Thanks submitted by /u/Hgat [link] [comments]
    [D] Portals for outsourcing preliminary data labeling
    (1 min) I am currently working on building a machine learning model, which will be used to automatically label tweets with regards to some prespecified labels (a total of four). Together with a team we are looking for an online tool which we could use to outsource preliminary labeling for later training of the model. So far we have been using a program built specifically for this task by one of the co-researchers, but now we want to switch to something more mainstream. The task we want to outsource goes as follows: participants will be given the text of the tweet and they will have to label it according to prespecified labels. Simple as that. Since the tweets are in Polish we are not interested in any additional features, which would be specific to the English population. We are currently considering the two following platforms: https://prodi.gy/demo https://universaldatatool.com/app/ Did any of you take part in a similar study and/or have some other experience with using the above tools? Are there any other tools that could work for the task above that you could recommend? submitted by /u/Hub_Pli [link] [comments]
    [D] What has been your experience as a machine learning manager?
    (3 min) As the title says, I am curious to learn more about what this sub’s experience has been working with/as management. As context, the company I work for (series b startup) is asking me to take over as the machine learning team manager. While I’m flattered that they see me as a “rising star”, I am a bit concerned about moving away from the actual building, testing, and deployment of machine learning models. I have been an ML engineer for about a year and prior to that I worked as a DS at large tech firm for 2 years and have managed multiple teams prior to that as well. My questions for the sub are as follows: Does moving into a management role take me out of contention for more IC roles in the broader job market? Or do recruiters not care as much about that? I recognize that I will be doing less code development as manager, but just how much of a drop off should I expect? ML Managers of Reddit, how do you keep pace with the technical developments in the field without actually writing and submitting code? Thanks to everyone in advance! submitted by /u/frank_sobotka_ [link] [comments]
    [N] EasySynth - Unreal Engine plugin for easy creation of synthetic images (depth maps, optical flow, semantics, ...)
    (2 min) Hello Community! We needed a user-friendly image dataset creation tool for Machine Learning and Computer Vision purposes, but since all we could find were advanced simulators, we decided to create an open source one in case anybody else found it useful. EasySynth is an easy-to-use UnrealEngine plugin which enables simple generation of ground truth depth images, normal images, optical flow images and semantic images. EasySynth does not require knowledge of either C++ or Blueprints. It utilizes a LevelSequence (checkout the video using the link below) to define the movement of the camera and provides a simple interface for semantic labeling of actors present in the scene. It supports exporting camera positions and rotations at each frame, as well as the following output formats: Color images rendered by default Depth grayscale images Pixel normal images Optical flow between frames Images with actor semantic labels As an example, the following output can be created in 20 minutes, from the project setup to the output rendering (using a third-party level). Check out the workflow timelapse. Provides GT data for training depth estimation, normal estimation, semantic or optical flow models For more details check out the GitHub repo. We are working on making this plugin available for free in the Unreal Engine plugin marketplace. The current version works with UE 4.27. We hope somebody will find it useful! submitted by /u/Ok_Compote_3050 [link] [comments]
    [D] Is there value in adding comments to large ML projects?
    (1 min) So I've started working with yolov5 which has literally 0 comments. If I went and added comments to the code, would it Be appreciated Actually be accepted into the main branch If it would never be accepted I don't really want to spend the time tbh. submitted by /u/a_slay_nub [link] [comments]
    [D] How to do self-research?
    (3 min) Hai! I want to gain research experience in Deep Learning (preferably NLP). How should I go about doing self-research? Working as an RA is one way but most of the professors I've contacted are either not accepting people outside their universities or want to me do full-time which I cannot. I want to build my research profile which will be a valuable asset my grad school application also. Please help me figuring out ways to do research and build top notch projects. submitted by /u/ICantStopMe- [link] [comments]
    [P] Audio data augmentation techniques
    (1 min) Audio data augmentation is a great solution to make your audio AI models more robust. In my new video in the "Audio Data Augmentation" series, you can learn about augmentation techniques both in the raw audio and spectrogram domains. Specifically, you’ll learn about: 📌 Time shifting 📌 Time stretching 📌 Pitch scaling 📌 Noise addition 📌 Impulse response addition 📌 Filters 📌 Polarity Inversion 📌 Random gain 📌 Time masking 📌 Frequency masking Check out the video: https://www.youtube.com/watch?v=bm1cQfb_pLA&list=PL-wATfeyAMNoR4aqS-Fv0GRmS6bx5RtTW&index=3 submitted by /u/diabulusInMusica [link] [comments]
    [Research] Talks about Bayesian Optimization
    (1 min) [Research] Given the previous post I had (https://www.reddit.com/r/MachineLearning/comments/rhppgq/research_new_library_for_bayesian_optimisation/hot7x74/?context=3), I realised a lot are not sure what BO is. I put out a new playlist with two (short) videos for now about BO and its importance as promised: Video 1: https://www.youtube.com/watch?v=YwFiB7vOQD8&t=4s Video 2: https://www.youtube.com/watch?v=85dslFi7IB8 ​ I will increase the videos over time and detail how BO actually works. The first two-three videos are there to serve as examples conveying the class of problems this field covers. ​ If you are interested, feel free to check them out! View Poll submitted by /u/Ok_Can2425 [link] [comments]
    [D] Thoughts on IEEE Access as a journal?
    (3 min) I’m doing my PhD and tempted to publish my latest work there due to the faster review process when compared to other journals. My supervisor is encouraging me to publish there and get started on the next piece of work as soon as possible. He likes to get his PhD students finished quickly via article based phds and says that Access counts as much as any of the other mainstream journals (towards achieving the goal of article based phd). I want to get involved in industry research as soon as possible once graduating and am wondering if publishing in access will diminish my prospects vs publishing in the likes of NN or something like that. submitted by /u/DaBeastGeek [link] [comments]
    [D] Does PyTorch credit contributors?
    (1 min) Asked them on two occasions, no response. My PRs were closed and "landed" by a dev, so no standard Github contributor credit. If I were to contribute an optimizer, they'd merge it without attribution? submitted by /u/OverLordGoldDragon [link] [comments]
    [D] What is state of the art for audio generation?
    (1 min) I’m looking for suggestions for short sound effects, music, and / or speech but primarily concerned about sound effects. Is WaveGAN still dominant? submitted by /u/gameml [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Novel Viewpoint Tennis from Single Camera
    (0 min) submitted by /u/Odd_Temporary_9736 [link] [comments]
  • Reinforcement Learning

    Tips for remembering DRL algorithms for newbies
    (1 min) Hi all, I am studying the most popular DRL algorithms (NFQ, DQN, DDQN, Dueling DDQN, Reinforce, VPG, A3C, A2C, DDPG, TD3, SAC, PPO) and I was wondering a couple of things. A) What is the level of detail that you need to remember in order to use those algs efficiently? In other words, do you need to remember every bit of them? B) Do you have tricks and tips to remember them more easily, maybe by pointing me at some maps that highlight connections between them Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    Neural networks in Unity 3D
    (1 min) For a school project I need to create an agent that learns how to navigate a environment using reinforcement learning (DQN, policy gradient). To do this I need to be able to create a neural network which I will need to train. My question now is if there are any Neural Network libraries that I can use for creating and training neural networks. I know unity has the ML-agents toolkit but this does not seem to fit my need. Thanks! submitted by /u/IssueRepresentative5 [link] [comments]
    what is the best approach to POMDP environment?
    (2 min) Hello, I have some questions about the POMDP environment. First, I thought that in a POMDP environment, a policy-based method would be better than a value-based method. For example, Alice Grid World. Is it generally correct? Second, when training a limited view agent in a tabular environment, I expected the rppo agent to perform better than cnn-based ppo. But it didn't. I used this repository that was already implemented and saw slow learning based on this. When I trained a Starcraft II agent, there are really huge differences between those architecture. So I just wonder your opinions. Very Thanks! submitted by /u/Spiritual_Fig3632 [link] [comments]
    issue with modifying mujoco environment
    (1 min) I am trying to modify the inverteddoublependulum environment in mujoco and tried to changed the xml parameters in the gym envs folder, however when i load the environment in my script, it still loads back the same un-modified environment. I am creating a closed-loop kinematic chain by adding another cart with poles connecting to the original cart. Am i missing a step such as using some wrapper or having to register the environment? Thanks, its first first time doing this. submitted by /u/Lifeisunfair210 [link] [comments]
    Why do they call it 'greedy with respect to V'?
    (3 min) Hi everyone I just started reading 'Reinforcement Learning' by Richard S. Sutton and Andrew G. Barto and I hit a point where I'm not really sure why they describe it this way. (And it's not just this book, I see this expression everywhere) In chapter 3.6 they compute the optimal state value function by the following equation. Bellman optimality equation for state value It says afterwards how to choose the best policy with the optimal state value function If you have the optimal value function, v* then the actions that appear best after a one-step search will be optimal actions. Another way of saying this is that any policy that is greedy with respect to the optimal evaluation function v\* is an optimal policy. In chapter 4.2 where they describe policy improvement, they give us the following equation for updating the policy based on v_pi. Policy update based on v_pi The greedy policy takes the action that looks best in the short term—after one step of lookahead—according to v_pi ... The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement. To me, when you say "policy that is greedy with respect to v", it sounds like you just simply choose the action that moves the agent to the next state with the highest state value. But this cannot be true because you have to choose the action that maximizes the following function of v, not simply v itself. Function of v you wanna maximize And the equation above is q_pi(s, a). So shouldn't the correct phrase be "greedy with respect to q" and not v? I am wondering if I'm just overthinking things at this point... submitted by /u/peterpan1998 [link] [comments]
    Did anyone try Panda-Gym?
    (1 min) The acquisition of Mujoco makes Openai to remove the robotics from their repo. I had no choice but to find an alternative. Then I found https://github.com/qgallouedec/panda-gym which is built on PyBullet. submitted by /u/chezhek [link] [comments]

2022-01-12

  • cs.LG updates on arXiv.org

    AdvSim: Generating Safety-Critical Scenarios for Self-Driving Vehicles. (arXiv:2101.06549v3 [cs.RO] UPDATED)
    (2 min) As self-driving systems become better, simulating scenarios where the autonomy stack may fail becomes more important. Traditionally, those scenarios are generated for a few scenes with respect to the planning module that takes ground-truth actor states as input. This does not scale and cannot identify all possible autonomy failures, such as perception failures due to occlusion. In this paper, we propose AdvSim, an adversarial framework to generate safety-critical scenarios for any LiDAR-based autonomy system. Given an initial traffic scenario, AdvSim modifies the actors' trajectories in a physically plausible manner and updates the LiDAR sensor data to match the perturbed world. Importantly, by simulating directly from sensor data, we obtain adversarial scenarios that are safety-critical for the full autonomy stack. Our experiments show that our approach is general and can identify thousands of semantically meaningful safety-critical scenarios for a wide range of modern self-driving systems. Furthermore, we show that the robustness and safety of these systems can be further improved by training them with scenarios generated by AdvSim.
    Logically Sound Arguments for the Effectiveness of ML Safety Measures. (arXiv:2111.02649v2 [cs.LO] UPDATED)
    (0 min) We investigate the issues of achieving sufficient rigor in the arguments for the safety of machine learning functions. By considering the known weaknesses of DNN-based 2D bounding box detection algorithms, we sharpen the metric of imprecise pedestrian localization by associating it with the safety goal. The sharpening leads to introducing a conservative post-processor after the standard non-max-suppression as a counter-measure. We then propose a semi-formal assurance case for arguing the effectiveness of the post-processor, which is further translated into formal proof obligations for demonstrating the soundness of the arguments. Applying theorem proving not only discovers the need to introduce missing claims and mathematical concepts but also reveals the limitation of Dempster-Shafer's rules used in semi-formal argumentation.
    Cross-view Self-Supervised Learning on Heterogeneous Graph Neural Network via Bootstrapping. (arXiv:2201.03340v1 [cs.LG])
    (2 min) Heterogeneous graph neural networks can represent information of heterogeneous graphs with excellent ability. Recently, self-supervised learning manner is researched which learns the unique expression of a graph through a contrastive learning method. In the absence of labels, this learning methods show great potential. However, contrastive learning relies heavily on positive and negative pairs, and generating high-quality pairs from heterogeneous graphs is difficult. In this paper, in line with recent innovations in self-supervised learning, we introduce a that can generate good representations without generating large number of pairs. In addition, paying attention to the fact that heterogeneous graphs can be viewed from two perspectives in this process, high-level expressions in the graphs are captured and expressed. The proposed model showed state-of-the-art performance than other methods in various real world datasets.
    A Survey of Uncertainty in Deep Neural Networks. (arXiv:2107.03342v2 [cs.LG] UPDATED)
    (0 min) Due to their increasing spread, confidence in neural network predictions became more and more important. However, basic neural networks do not deliver certainty estimates or suffer from over or under confidence. Many researchers have been working on understanding and quantifying uncertainty in a neural network's prediction. As a result, different types and sources of uncertainty have been identified and a variety of approaches to measure and quantify uncertainty in neural networks have been proposed. This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges, and identifies potential research opportunities. It is intended to give anyone interested in uncertainty estimation in neural networks a broad overview and introduction, without presupposing prior knowledge in this field. A comprehensive introduction to the most crucial sources of uncertainty is given and their separation into reducible model uncertainty and not reducible data uncertainty is presented. The modeling of these uncertainties based on deterministic neural networks, Bayesian neural networks, ensemble of neural networks, and test-time data augmentation approaches is introduced and different branches of these fields as well as the latest developments are discussed. For a practical application, we discuss different measures of uncertainty, approaches for the calibration of neural networks and give an overview of existing baselines and implementations. Different examples from the wide spectrum of challenges in different fields give an idea of the needs and challenges regarding uncertainties in practical applications. Additionally, the practical limitations of current methods for mission- and safety-critical real world applications are discussed and an outlook on the next steps towards a broader usage of such methods is given.
    A Tutorial on Learning With Bayesian Networks. (arXiv:2002.00269v3 [cs.LG] UPDATED)
    (0 min) A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study.
    AIDA: An Active Inference-based Design Agent for Audio Processing Algorithms. (arXiv:2112.13366v2 [eess.AS] UPDATED)
    (0 min) In this paper we present AIDA, which is an active inference-based agent that iteratively designs a personalized audio processing algorithm through situated interactions with a human client. The target application of AIDA is to propose on-the-spot the most interesting alternative values for the tuning parameters of a hearing aid (HA) algorithm, whenever a HA client is not satisfied with their HA performance. AIDA interprets searching for the "most interesting alternative" as an issue of optimal (acoustic) context-aware Bayesian trial design. In computational terms, AIDA is realized as an active inference-based agent with an Expected Free Energy criterion for trial design. This type of architecture is inspired by neuro-economic models on efficient (Bayesian) trial design in brains and implies that AIDA comprises generative probabilistic models for acoustic signals and user responses. We propose a novel generative model for acoustic signals as a sum of time-varying auto-regressive filters and a user response model based on a Gaussian Process Classifier. The full AIDA agent has been implemented in a factor graph for the generative model and all tasks (parameter learning, acoustic context classification, trial design, etc.) are realized by variational message passing on the factor graph. All verification and validation experiments and demonstrations are freely accessible at our GitHub repository.
    Learning to Simulate Self-Driven Particles System with Coordinated Policy Optimization. (arXiv:2110.13827v2 [cs.LG] UPDATED)
    (0 min) Self-Driven Particles (SDP) describe a category of multi-agent systems common in everyday life, such as flocking birds and traffic flows. In a SDP system, each agent pursues its own goal and constantly changes its cooperative or competitive behaviors with its nearby agents. Manually designing the controllers for such SDP system is time-consuming, while the resulting emergent behaviors are often not realistic nor generalizable. Thus the realistic simulation of SDP systems remains challenging. Reinforcement learning provides an appealing alternative for automating the development of the controller for SDP. However, previous multi-agent reinforcement learning (MARL) methods define the agents to be teammates or enemies before hand, which fail to capture the essence of SDP where the role of each agent varies to be cooperative or competitive even within one episode. To simulate SDP with MARL, a key challenge is to coordinate agents' behaviors while still maximizing individual objectives. Taking traffic simulation as the testing bed, in this work we develop a novel MARL method called Coordinated Policy Optimization (CoPO), which incorporates social psychology principle to learn neural controller for SDP. Experiments show that the proposed method can achieve superior performance compared to MARL baselines in various metrics. Noticeably the trained vehicles exhibit complex and diverse social behaviors that improve performance and safety of the population as a whole. Demo video and source code are available at: https://decisionforce.github.io/CoPO/
    Communication-Efficient Federated Learning via Predictive Coding. (arXiv:2108.00918v2 [cs.DC] UPDATED)
    (0 min) Federated learning can enable remote workers to collaboratively train a shared machine learning model while allowing training data to be kept locally. In the use case of wireless mobile devices, the communication overhead is a critical bottleneck due to limited power and bandwidth. Prior work has utilized various data compression tools such as quantization and sparsification to reduce the overhead. In this paper, we propose a predictive coding based compression scheme for federated learning. The scheme has shared prediction functions among all devices and allows each worker to transmit a compressed residual vector derived from the reference. In each communication round, we select the predictor and quantizer based on the rate-distortion cost, and further reduce the redundancy with entropy coding. Extensive simulations reveal that the communication cost can be reduced up to 99% with even better learning performance when compared with other baseline methods.
    L\'evy Induced Stochastic Differential Equation Equipped with Neural Network for Time Series Forecasting. (arXiv:2111.13164v2 [cs.LG] UPDATED)
    (0 min) With the fast development of modern deep learning techniques, the study of dynamic systems and neural networks is increasingly benefiting each other in a lot of different ways. Since uncertainties often arise in real world observations, SDEs (stochastic differential equations) come to play an important role. To be more specific, in this paper, we use a collection of SDEs equipped with neural networks to predict long-term trend of noisy time series which has big jump properties and high probability distribution shift. Our contributions are, first, we explored SDEs driven by $\alpha$-stable L\'evy motion to model the time series data and solved the problem through neural network approximation. Second, we theoretically proved the convergence of the model and obtained the convergence rate. Finally, we illustrated our method by applying it to stock marketing time series prediction and found the convergence order of error.
    Neural-PDE: A RNN based neural network for solving time dependent PDEs. (arXiv:2009.03892v3 [math.NA] UPDATED)
    (0 min) Partial differential equations (PDEs) play a crucial role in studying a vast number of problems in science and engineering. Numerically solving nonlinear and/or high-dimensional PDEs is often a challenging task. Inspired by the traditional finite difference and finite elements methods and emerging advancements in machine learning, we propose a sequence deep learning framework called Neural-PDE, which allows to automatically learn governing rules of any time-dependent PDE system from existing data by using a bidirectional LSTM encoder, and predict the next n time steps data. One critical feature of our proposed framework is that the Neural-PDE is able to simultaneously learn and simulate the multiscale variables.We test the Neural-PDE by a range of examples from one-dimensional PDEs to a high-dimensional and nonlinear complex fluids model. The results show that the Neural-PDE is capable of learning the initial conditions, boundary conditions and differential operators without the knowledge of the specific form of a PDE system.In our experiments the Neural-PDE can efficiently extract the dynamics within 20 epochs training, and produces accurate predictions. Furthermore, unlike the traditional machine learning approaches in learning PDE such as CNN and MLP which require vast parameters for model precision, Neural-PDE shares parameters across all time steps, thus considerably reduces the computational complexity and leads to a fast learning algorithm.
    A semigroup method for high dimensional elliptic PDEs and eigenvalue problems based on neural networks. (arXiv:2105.03480v3 [math.NA] UPDATED)
    (0 min) In this paper, we propose a semigroup method for solving high-dimensional elliptic partial differential equations (PDEs) and the associated eigenvalue problems based on neural networks. For the PDE problems, we reformulate the original equations as variational problems with the help of semigroup operators and then solve the variational problems with neural network (NN) parameterization. The main advantages are that no mixed second-order derivative computation is needed during the stochastic gradient descent training and that the boundary conditions are taken into account automatically by the semigroup operator. Unlike popular methods like PINN \cite{raissi2019physics} and Deep Ritz \cite{weinan2018deep} where the Dirichlet boundary condition is enforced solely through penalty functions and thus changes the true solution, the proposed method is able to address the boundary conditions without penalty functions and it gives the correct true solution even when penalty functions are added, thanks to the semigroup operator. For eigenvalue problems, a primal-dual method is proposed, efficiently resolving the constraint with a simple scalar dual variable and resulting in a faster algorithm compared with the BSDE solver \cite{han2020solving} in certain problems such as the eigenvalue problem associated with the linear Schr\"odinger operator. Numerical results are provided to demonstrate the performance of the proposed methods.
    MARINA: Faster Non-Convex Distributed Learning with Compression. (arXiv:2102.07845v3 [cs.LG] UPDATED)
    (0 min) We develop and analyze MARINA: a new communication efficient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences that is reminiscent of but different from the strategy employed in the DIANA method of Mishchenko et al. (2019). Unlike virtually all competing distributed first-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator, which is the key to its superior theoretical and practical performance. The communication complexity bounds we prove for MARINA are evidently better than those of all previous first-order methods. Further, we develop and analyze two variants of MARINA: VR-MARINA and PP-MARINA. The first method is designed for the case when the local loss functions owned by clients are either of a finite sum or of an expectation form, and the second method allows for a partial participation of clients -- a feature important in federated learning. All our methods are superior to previous state-of-the-art methods in terms of oracle/communication complexity. Finally, we provide a convergence analysis of all methods for problems satisfying the Polyak-Lojasiewicz condition.
    Swin Transformer for Fast MRI. (arXiv:2201.03230v1 [eess.IV])
    (0 min) Magnetic resonance imaging (MRI) is an important non-invasive clinical tool that can produce high-resolution and reproducible images. However, a long scanning time is required for high-quality MR images, which leads to exhaustion and discomfort of patients, inducing more artefacts due to voluntary movements of the patients and involuntary physiological movements. To accelerate the scanning process, methods by k-space undersampling and deep learning based reconstruction have been popularised. This work introduced SwinMR, a novel Swin transformer based method for fast MRI reconstruction. The whole network consisted of an input module (IM), a feature extraction module (FEM) and an output module (OM). The IM and OM were 2D convolutional layers and the FEM was composed of a cascaded of residual Swin transformer blocks (RSTBs) and 2D convolutional layers. The RSTB consisted of a series of Swin transformer layers (STLs). The shifted windows multi-head self-attention (W-MSA/SW-MSA) of STL was performed in shifted windows rather than the multi-head self-attention (MSA) of the original transformer in the whole image space. A novel multi-channel loss was proposed by using the sensitivity maps, which was proved to reserve more textures and details. We performed a series of comparative studies and ablation studies in the Calgary-Campinas public brain MR dataset and conducted a downstream segmentation experiment in the Multi-modal Brain Tumour Segmentation Challenge 2017 dataset. The results demonstrate our SwinMR achieved high-quality reconstruction compared with other benchmark methods, and it shows great robustness with different undersampling masks, under noise interruption and on different datasets. The code is publicly available at https://github.com/ayanglab/SwinMR.
    IoTGAN: GAN Powered Camouflage Against Machine Learning Based IoT Device Identification. (arXiv:2201.03281v1 [cs.CR])
    (0 min) With the proliferation of IoT devices, researchers have developed a variety of IoT device identification methods with the assistance of machine learning. Nevertheless, the security of these identification methods mostly depends on collected training data. In this research, we propose a novel attack strategy named IoTGAN to manipulate an IoT device's traffic such that it can evade machine learning based IoT device identification. In the development of IoTGAN, we have two major technical challenges: (i) How to obtain the discriminative model in a black-box setting, and (ii) How to add perturbations to IoT traffic through the manipulative model, so as to evade the identification while not influencing the functionality of IoT devices. To address these challenges, a neural network based substitute model is used to fit the target model in black-box settings, it works as a discriminative model in IoTGAN. A manipulative model is trained to add adversarial perturbations into the IoT device's traffic to evade the substitute model. Experimental results show that IoTGAN can successfully achieve the attack goals. We also develop efficient countermeasures to protect machine learning based IoT device identification from been undermined by IoTGAN.
    Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: application to evolutionary modeling of clinical pathways. (arXiv:2004.01123v2 [cs.LG] UPDATED)
    (0 min) The paper proposes and investigates an approach for surrogate-assisted performance prediction of data-driven knowledge discovery algorithms. The approach is based on the identification of surrogate models for prediction of the target algorithm's quality and performance. The proposed approach was implemented and investigated as applied to an evolutionary algorithm for discovering clusters of interpretable clinical pathways in electronic health records of patients with acute coronary syndrome. Several clustering metrics and execution time were used as the target quality and performance metrics respectively. An analytical software prototype based on the proposed approach for the prediction of algorithm characteristics and feature analysis was developed to provide a more interpretable prediction of the target algorithm's performance and quality that can be further used for parameter tuning.
    Automated Lay Language Summarization of Biomedical Scientific Reviews. (arXiv:2012.12573v3 [cs.CL] UPDATED)
    (0 min) Health literacy has emerged as a crucial factor in making appropriate health decisions and ensuring treatment outcomes. However, medical jargon and the complex structure of professional language in this domain make health information especially hard to interpret. Thus, there is an urgent unmet need for automated methods to enhance the accessibility of the biomedical literature to the general population. This problem can be framed as a type of translation problem between the language of healthcare professionals, and that of the general public. In this paper, we introduce the novel task of automated generation of lay language summaries of biomedical scientific reviews, and construct a dataset to support the development and evaluation of automated methods through which to enhance the accessibility of the biomedical literature. We conduct analyses of the various challenges in solving this task, including not only summarization of the key points but also explanation of background knowledge and simplification of professional language. We experiment with state-of-the-art summarization models as well as several data augmentation techniques, and evaluate their performance using both automated metrics and human assessment. Results indicate that automatically generated summaries produced using contemporary neural architectures can achieve promising quality and readability as compared with reference summaries developed for the lay public by experts (best ROUGE-L of 50.24 and Flesch-Kincaid readability score of 13.30). We also discuss the limitations of the current attempt, providing insights and directions for future work.
    Trade-off on Sim2Real Learning: Real-world Learning Faster than Simulations. (arXiv:2007.10675v4 [cs.RO] UPDATED)
    (0 min) Deep Reinforcement Learning (DRL) experiments are commonly performed in simulated environments due to the tremendous training sample demands from deep neural networks. In contrast, model-based Bayesian Learning allows a robot to learn good policies within a few trials in the real world. Although it takes fewer iterations, Bayesian methods pay a relatively higher computational cost per trial, and the advantage of such methods is strongly tied to dimensionality and noise. In here, we compare a Deep Bayesian Learning algorithm with a model-free DRL algorithm while analyzing our results collected from both simulations and real-world experiments. While considering Sim and Real learning, our experiments show that the sample-efficient Deep Bayesian RL performance is better than DRL even when computation time (as opposed to number of iterations) is taken in consideration. Additionally, the difference in computation time between Deep Bayesian RL performed in simulation and in experiments point to a viable path to traverse the reality gap. We also show that a mix between Sim and Real does not outperform a purely Real approach, pointing to the possibility that reality can provide the best prior knowledge to a Bayesian Learning. Roboticists design and build robots every day, and our results show that a higher learning efficiency in the real-world will shorten the time between design and deployment by skipping simulations.
    Toward a Fairness-Aware Scoring System for Algorithmic Decision-Making. (arXiv:2109.10053v3 [cs.LG] UPDATED)
    (0 min) Scoring systems, as a type of predictive model, have significant advantages in interpretability and transparency and facilitate quick decision-making. As such, scoring systems have been extensively used in a wide variety of industries such as healthcare and criminal justice. However, the fairness issues in these models have long been criticized, and the use of big data and machine learning algorithms in the construction of scoring systems heightens this concern. In this paper, we propose a general framework to create fairness-aware, data-driven scoring systems. First, we develop a social welfare function that incorporates both efficiency and group fairness. Then, we transform the social welfare maximization problem into the risk minimization task in machine learning, and derive a fairness-aware scoring system with the help of mixed integer programming. Lastly, several theoretical bounds are derived for providing parameter selection suggestions. Our proposed framework provides a suitable solution to address group fairness concerns in the development of scoring systems. It enables policymakers to set and customize their desired fairness requirements as well as other application-specific constraints. We test the proposed algorithm with several empirical data sets. Experimental evidence supports the effectiveness of the proposed scoring system in achieving the optimal welfare of stakeholders and in balancing the needs for interpretability, fairness, and efficiency.
    Bridging the gap to real-world for network intrusion detection systems with data-centric approach. (arXiv:2110.13655v2 [cs.CR] UPDATED)
    (0 min) Most research using machine learning (ML) for network intrusion detection systems (NIDS) uses well-established datasets such as KDD-CUP99, NSL-KDD, UNSW-NB15, and CICIDS-2017. In this context, the possibilities of machine learning techniques are explored, aiming for metrics improvements compared to the published baselines (model-centric approach). However, those datasets present some limitations as aging that make it unfeasible to transpose those ML-based solutions to real-world applications. This paper presents a systematic data-centric approach to address the current limitations of NIDS research, specifically the datasets. This approach generates NIDS datasets composed of the most recent network traffic and attacks, with the labeling process integrated by design.
    GBRS: An Unified Model of Pawlak Rough Set and Neighborhood Rough Set. (arXiv:2201.03349v1 [cs.AI])
    (0 min) Pawlak rough set and neighborhood rough set are the two most common rough set theoretical models. Pawlawk can use equivalence classes to represent knowledge, but it cannot process continuous data; neighborhood rough sets can process continuous data, but it loses the ability of using equivalence classes to represent knowledge. To this end, this paper presents a granular-ball rough set based on the granlar-ball computing. The granular-ball rough set can simultaneously represent Pawlak rough sets, and the neighborhood rough set, so as to realize the unified representation of the two. This makes the granular-ball rough set not only can deal with continuous data, but also can use equivalence classes for knowledge representation. In addition, we propose an implementation algorithms of granular-ball rough sets. The experimental resuts on benchmark datasets demonstrate that, due to the combination of the robustness and adaptability of the granular-ball computing, the learning accuracy of the granular-ball rough set has been greatly improved compared with the Pawlak rough set and the traditional neighborhood rough set. The granular-ball rough set also outperforms nine popular or the state-of-the-art feature selection methods.
    DiffSDFSim: Differentiable Rigid-Body Dynamics With Implicit Shapes. (arXiv:2111.15318v2 [cs.CV] UPDATED)
    (0 min) Differentiable physics is a powerful tool in computer vision and robotics for scene understanding and reasoning about interactions. Existing approaches have frequently been limited to objects with simple shape or shapes that are known in advance. In this paper, we propose a novel approach to differentiable physics with frictional contacts which represents object shapes implicitly using signed distance fields (SDFs). Our simulation supports contact point calculation even when the involved shapes are nonconvex. Moreover, we propose ways for differentiating the dynamics for the object shape to facilitate shape optimization using gradient-based methods. In our experiments, we demonstrate that our approach allows for model-based inference of physical parameters such as friction coefficients, mass, forces or shape parameters from trajectory and depth image observations in several challenging synthetic scenarios and a real image sequence.
    Optimal radial basis for density-based atomic representations. (arXiv:2105.08717v2 [stat.ML] UPDATED)
    (0 min) The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of the most popular representations can be seen as an expansion of the symmetrized correlations of the atom density, and differ mainly by the choice of basis. Considerable effort has been dedicated to the optimization of the basis set, typically driven by heuristic considerations on the behavior of the regression target. Here we take a different, unsupervised viewpoint, aiming to determine the basis that encodes in the most compact way possible the structural information that is relevant for the dataset at hand. For each training dataset and number of basis functions, one can determine a unique basis that is optimal in this sense, and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, particularly when constructing representations that correspond to high-body order correlations. We present examples that involve both molecular and condensed-phase machine-learning models.
    Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ..... (arXiv:2110.03183v3 [cs.SD] UPDATED)
    (0 min) This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely end-to-end Transformer architectures. We, in this work, explore an approach, based on Bag-of-Words model. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. We utilize micro and macro level clustered vanilla embeddings, and use a MLP head for classification. We only use feed-forward encoder-decoder models to get the bottlenecks of spectral envelops, spectral patches and slices as well as multi-resolution spectra. A classification head (a feed-forward layer), similar to the approach in SimCLR is trained on a learned representation. Using simple codes learned on latent representations, we show how we surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures. This work hopefully would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.
    Unsupervised Image Decomposition with Phase-Correlation Networks. (arXiv:2110.03473v3 [cs.CV] UPDATED)
    (0 min) The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings. Recently, different methods have been proposed to learn object-centric representations from data in an unsupervised manner. These methods often rely on latent representations learned by deep neural networks, hence requiring high computational costs and large amounts of curated data. Such models are also difficult to interpret. To address these challenges, we propose the Phase-Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object components, which are represented as transformed versions of a set of learned object prototypes. The core building block in PCDNet is the Phase-Correlation Cell (PC Cell), which exploits the frequency-domain representation of the images in order to estimate the transformation between an object prototype and its transformed version in the image. In our experiments, we show how PCDNet outperforms state-of-the-art methods for unsupervised object discovery and segmentation on simple benchmark datasets and on more challenging data, while using a small number of learnable parameters and being fully interpretable.
    Continual Learning via Bit-Level Information Preserving. (arXiv:2105.04444v2 [cs.LG] UPDATED)
    (0 min) Continual learning tackles the setting of learning different tasks sequentially. Despite the lots of previous solutions, most of them still suffer significant forgetting or expensive memory cost. In this work, targeted at these problems, we first study the continual learning process through the lens of information theory and observe that forgetting of a model stems from the loss of \emph{information gain} on its parameters from the previous tasks when learning a new task. From this viewpoint, we then propose a novel continual learning approach called Bit-Level Information Preserving (BLIP) that preserves the information gain on model parameters through updating the parameters at the bit level, which can be conveniently implemented with parameter quantization. More specifically, BLIP first trains a neural network with weight quantization on the new incoming task and then estimates information gain on each parameter provided by the task data to determine the bits to be frozen to prevent forgetting. We conduct extensive experiments ranging from classification tasks to reinforcement learning tasks, and the results show that our method produces better or on par results comparing to previous state-of-the-arts. Indeed, BLIP achieves close to zero forgetting while only requiring constant memory overheads throughout continual learning.
    Retiring Adult: New Datasets for Fair Machine Learning. (arXiv:2108.04884v3 [cs.LG] UPDATED)
    (0 min) Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. We create prediction tasks relating to income, employment, health, transportation, and housing. The data span multiple years and all states of the United States, allowing researchers to study temporal shift and geographic variation. We highlight a broad initial sweep of new empirical insights relating to trade-offs between fairness criteria, performance of algorithmic interventions, and the role of distribution shift based on our new datasets. Our findings inform ongoing debates, challenge some existing narratives, and point to future research directions. Our datasets are available at https://github.com/zykls/folktables.
    Sustainable AI: Environmental Implications, Challenges and Opportunities. (arXiv:2111.00364v2 [cs.LG] UPDATED)
    (0 min) This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, we capture the operational and manufacturing carbon footprint of AI computing and present an end-to-end analysis for what and how hardware-software design and at-scale optimization can help reduce the overall carbon footprint of AI. Based on the industry experience and lessons learned, we share the key challenges and chart out important development directions across the many dimensions of AI. We hope the key messages and insights presented in this paper can inspire the community to advance the field of AI in an environmentally-responsible manner.
    Grounded Relational Inference: Domain Knowledge Driven Explainable Autonomous Driving. (arXiv:2102.11905v2 [cs.AI] UPDATED)
    (0 min) Explainability is essential for autonomous vehicles and other robotics systems interacting with humans and other objects during operation. Humans need to understand and anticipate the actions taken by the machines for trustful and safe cooperation. In this work, we aim to develop an explainable model that generates explanations consistent with both human domain knowledge and the model's inherent causal relation. In particular, we focus on an essential building block of autonomous driving, multi-agent interaction modeling. We propose Grounded Relational Inference (GRI). It models an interactive system's underlying dynamics by inferring an interaction graph representing the agents' relations. We ensure a semantically meaningful interaction graph by grounding the relational latent space into semantic interactive behaviors defined with expert domain knowledge. We demonstrate that it can model interactive traffic scenarios under both simulation and real-world settings, and generate semantic graphs explaining the vehicle's behavior by their interactions.
    A Deep Learning-Accelerated Data Assimilation and Forecasting Workflow for Commercial-Scale Geologic Carbon Storage. (arXiv:2105.09468v2 [physics.geo-ph] UPDATED)
    (0 min) Fast assimilation of monitoring data to update forecasts of pressure buildup and carbon dioxide (CO2) plume migration under geologic uncertainties is a challenging problem in geologic carbon storage. The high computational cost of data assimilation with a high-dimensional parameter space impedes fast decision-making for commercial-scale reservoir management. We propose to leverage physical understandings of porous medium flow behavior with deep learning techniques to develop a fast history matching-reservoir response forecasting workflow. Applying an Ensemble Smoother Multiple Data Assimilation framework, the workflow updates geologic properties and predicts reservoir performance with quantified uncertainty from pressure history and CO2 plumes interpreted through seismic inversion. As the most computationally expensive component in such a workflow is reservoir simulation, we developed surrogate models to predict dynamic pressure and CO2 plume extents under multi-well injection. The surrogate models employ deep convolutional neural networks, specifically, a wide residual network and a residual U-Net. The workflow is validated against a flat three-dimensional reservoir model representative of a clastic shelf depositional environment. Intelligent treatments are applied to bridge between quantities in a true-3D reservoir model and those in a single-layer reservoir model. The workflow can complete history matching and reservoir forecasting with uncertainty quantification in less than one hour on a mainstream personal workstation.
    Comparison of Representation Learning Techniques for Tracking in time resolved 3D Ultrasound. (arXiv:2201.03319v1 [eess.IV])
    (0 min) 3D ultrasound (3DUS) becomes more interesting for target tracking in radiation therapy due to its capability to provide volumetric images in real-time without using ionizing radiation. It is potentially usable for tracking without using fiducials. For this, a method for learning meaningful representations would be useful to recognize anatomical structures in different time frames in representation space (r-space). In this study, 3DUS patches are reduced into a 128-dimensional r-space using conventional autoencoder, variational autoencoder and sliced-wasserstein autoencoder. In the r-space, the capability of separating different ultrasound patches as well as recognizing similar patches is investigated and compared based on a dataset of liver images. Two metrics to evaluate the tracking capability in the r-space are proposed. It is shown that ultrasound patches with different anatomical structures can be distinguished and sets of similar patches can be clustered in r-space. The results indicate that the investigated autoencoders have different levels of usability for target tracking in 3DUS.
    Individual Privacy Accounting via a Renyi Filter. (arXiv:2008.11193v4 [cs.CR] UPDATED)
    (0 min) We consider a sequential setting in which a single dataset of individuals is used to perform adaptively-chosen analyses, while ensuring that the differential privacy loss of each participant does not exceed a pre-specified privacy budget. The standard approach to this problem relies on bounding a worst-case estimate of the privacy loss over all individuals and all possible values of their data, for every single analysis. Yet, in many scenarios this approach is overly conservative, especially for "typical" data points which incur little privacy loss by participation in most of the analyses. In this work, we give a method for tighter privacy loss accounting based on the value of a personalized privacy loss estimate for each individual in each analysis. To implement the accounting method we design a filter for R\'enyi differential privacy. A filter is a tool that ensures that the privacy parameter of a composed sequence of algorithms with adaptively-chosen privacy parameters does not exceed a pre-specified budget. Our filter is simpler and tighter than the known filter for $(\epsilon,\delta)$-differential privacy by Rogers et al. We apply our results to the analysis of noisy gradient descent and show that personalized accounting can be practical, easy to implement, and can only make the privacy-utility tradeoff tighter.
    Gait Recognition Based on Deep Learning: A Survey. (arXiv:2201.03323v1 [cs.CV])
    (0 min) In general, biometry-based control systems may not rely on individual expected behavior or cooperation to operate appropriately. Instead, such systems should be aware of malicious procedures for unauthorized access attempts. Some works available in the literature suggest addressing the problem through gait recognition approaches. Such methods aim at identifying human beings through intrinsic perceptible features, despite dressed clothes or accessories. Although the issue denotes a relatively long-time challenge, most of the techniques developed to handle the problem present several drawbacks related to feature extraction and low classification rates, among other issues. However, deep learning-based approaches recently emerged as a robust set of tools to deal with virtually any image and computer-vision related problem, providing paramount results for gait recognition as well. Therefore, this work provides a surveyed compilation of recent works regarding biometric detection through gait recognition with a focus on deep learning approaches, emphasizing their benefits, and exposing their weaknesses. Besides, it also presents categorized and characterized descriptions of the datasets, approaches, and architectures employed to tackle associated constraints.
    Studying Hadronization by Machine Learning Techniques. (arXiv:2111.15655v2 [hep-ph] UPDATED)
    (0 min) Hadronization is a non-perturbative process, which theoretical description can not be deduced from first principles. Modeling hadron formation requires several assumptions and various phenomenological approaches. Utilizing state-of-the-art Computer Vision and Deep Learning algorithms, it is eventually possible to train neural networks to learn non-linear and non-perturbative features of the physical processes. In this study, results of two ResNet networks are presented by investigating global and kinematical quantities, indeed jet- and event-shape variables. The widely used Lund string fragmentation model is applied as a baseline in $\sqrt{s}= 7$ TeV proton-proton collisions to predict the most relevant observables at further LHC energies.
    Continual Learning of Long Topic Sequences in Neural Information Retrieval. (arXiv:2201.03356v1 [cs.IR])
    (0 min) In information retrieval (IR) systems, trends and users' interests may change over time, altering either the distribution of requests or contents to be recommended. Since neural ranking approaches heavily depend on the training data, it is crucial to understand the transfer capacity of recent IR approaches to address new domains in the long term. In this paper, we first propose a dataset based upon the MSMarco corpus aiming at modeling a long stream of topics as well as IR property-driven controlled settings. We then in-depth analyze the ability of recent neural IR models while continually learning those streams. Our empirical study highlights in which particular cases catastrophic forgetting occurs (e.g., level of similarity between tasks, peculiarities on text length, and ways of learning models) to provide future directions in terms of model design.
    SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients. (arXiv:2106.08208v8 [math.OC] UPDATED)
    (0 min) Adaptive gradient methods have shown excellent performances for solving many machine learning problems. Although multiple adaptive gradient methods were recently studied, they mainly focus on either empirical or theoretical aspects and also only work for specific problems by using some specific adaptive learning rates. Thus, it is desired to design a universal framework for practical algorithms of adaptive gradients with theoretical guarantee to solve general problems. To fill this gap, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Moreover, our framework can flexibly integrate the momentum and variance reduced techniques. In particular, our novel framework provides the convergence analysis support for adaptive gradient methods under the nonconvex setting. In theoretical analysis, we prove that our SUPER-ADAM algorithm can achieve the best known gradient (i.e., stochastic first-order oracle (SFO)) complexity of $\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point of nonconvex optimization, which matches the lower bound for stochastic smooth nonconvex optimization. In numerical experiments, we employ various deep learning tasks to validate that our algorithm consistently outperforms the existing adaptive algorithms. Code is available at https://github.com/LIJUNYI95/SuperAdam
    GraphFM: Graph Factorization Machines for Feature Interaction Modeling. (arXiv:2105.11866v3 [cs.LG] UPDATED)
    (0 min) Factorization machine (FM) is a prevalent approach to modeling pairwise (second-order) feature interactions when dealing with high-dimensional sparse data. However, on the one hand, FM fails to capture higher-order feature interactions suffering from combinatorial expansion, on the other hand, taking into account interaction between every pair of features may introduce noise and degrade prediction accuracy. To solve the problems, we propose a novel approach Graph Factorization Machine (GraphFM) by naturally representing features in the graph structure. In particular, a novel mechanism is designed to select the beneficial feature interactions and formulate them as edges between features. Then our proposed model which integrates the interaction function of FM into the feature aggregation strategy of Graph Neural Network (GNN), can model arbitrary-order feature interactions on the graph-structured features by stacking layers. Experimental results on several real-world datasets has demonstrated the rationality and effectiveness of our proposed approach.
    GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning. (arXiv:2106.06232v6 [cs.LG] UPDATED)
    (0 min) Deep Q Network (DQN) firstly kicked the door of deep reinforcement learning (DRL) via combining deep learning (DL) with reinforcement learning (RL), which has noticed that the distribution of the acquired data would change during the training process. DQN found this property might cause instability for training, so it proposed effective methods to handle the downside of the property. Instead of focusing on the unfavourable aspects, we find it critical for RL to ease the gap between the estimated data distribution and the ground truth data distribution while supervised learning (SL) fails to do so. From this new perspective, we extend the basic paradigm of RL called the Generalized Policy Iteration (GPI) into a more generalized version, which is called the Generalized Data Distribution Iteration (GDI). We see massive RL algorithms and techniques can be unified into the GDI paradigm, which can be considered as one of the special cases of GDI. We provide theoretical proof of why GDI is better than GPI and how it works. Several practical algorithms based on GDI have been proposed to verify the effectiveness and extensiveness of it. Empirical experiments prove our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.98% mean human normalized score (HNS), 1146.39% median HNS and 22 human world record breakthroughs (HWRB) using only 200M training frames. Our work aims to lead the RL research to step into the journey of conquering the human world records and seek real superhuman agents on both performance and efficiency.
    Refinement of Hottopixx Method for Nonnegative Matrix Factorization Under Noisy Separability. (arXiv:2109.02863v2 [cs.LG] UPDATED)
    (0 min) Hottopixx, proposed by Bittorf et al. at NIPS 2012, is an algorithm for solving nonnegative matrix factorization (NMF) problems under the separability assumption. Separable NMFs have important applications, such as topic extraction from documents and unmixing of hyperspectral images. In such applications, the robustness of the algorithm to noise is the key to the success. Hottopixx has been shown to be robust to noise, and its robustness can be further enhanced through postprocessing. However, there is a drawback. Hottopixx and its postprocessing require us to estimate the noise level involved in the matrix we want to factorize before running, since they use it as part of the input data. The noise-level estimation is not an easy task. In this paper, we overcome this drawback. We present a refinement of Hottopixx and its postprocessing that runs without prior knowledge of the noise level. We show that the refinement has almost the same robustness to noise as the original algorithm.
    GTAdam: Gradient Tracking with Adaptive Momentum for Distributed Online Optimization. (arXiv:2009.01745v2 [math.OC] UPDATED)
    (0 min) This paper deals with a network of computing agents aiming to solve an online optimization problem in a distributed fashion, i.e., by means of local computation and communication, without any central coordinator. We propose the gradient tracking with adaptive momentum estimation (GTAdam) distributed algorithm, which combines a gradient tracking mechanism with first and second order momentum estimates of the gradient. The algorithm is analyzed in the online setting for strongly convex cost functions with Lipschitz continuous gradients. We provide an upper bound for the dynamic regret given by a term related to the initial conditions, and another term related to the temporal variations of the objective functions. Moreover, a linear convergence rate is guaranteed in the static set-up. The algorithm is tested on a time-varying classification problem, on a (moving) target localization problem and in a stochastic optimization setup from image classification. In these numerical experiments from multi-agent learning, GTAdam outperforms state-of-the-art distributed optimization methods.
    Stability Analysis of Unfolded WMMSE for Power Allocation. (arXiv:2110.07471v2 [eess.SP] UPDATED)
    (0 min) Power allocation is one of the fundamental problems in wireless networks and a wide variety of algorithms address this problem from different perspectives. A common element among these algorithms is that they rely on an estimation of the channel state, which may be inaccurate on account of hardware defects, noisy feedback systems, and environmental and adversarial disturbances. Therefore, it is essential that the output power allocation of these algorithms is stable with respect to input perturbations, to the extent that the variations in the output are bounded for bounded variations in the input. In this paper, we focus on UWMMSE -- a modern algorithm leveraging graph neural networks --, and illustrate its stability to additive input perturbations of bounded energy through both theoretical analysis and empirical validation.
    Understanding occupants' behaviour, engagement, emotion, and comfort indoors with heterogeneous sensors and wearables. (arXiv:2105.06637v2 [cs.HC] UPDATED)
    (0 min) We conducted a field study at a K-12 private school in the suburbs of Melbourne, Australia. The data capture contained two elements: First, a 5-month longitudinal field study In-Gauge using two outdoor weather stations, as well as indoor weather stations in 17 classrooms and temperature sensors on the vents of occupant-controlled room air-conditioners; these were collated into individual datasets for each classroom at a 5-minute logging frequency, including additional data on occupant presence. The dataset was used to derive predictive models of how occupants operate room air-conditioning units. Second, we tracked 23 students and 6 teachers in a 4-week cross-sectional study En-Gage, using wearable sensors to log physiological data, as well as daily surveys to query the occupants' thermal comfort, learning engagement, emotions and seating behaviours. Overall, the combined dataset could be used to analyse the relationships between indoor/outdoor climates and students' behaviours/mental states on campus, which provide opportunities for the future design of intelligent feedback systems to benefit both students and staff.
    Federated learning and next generation wireless communications: A survey on bidirectional relationship. (arXiv:2110.07649v2 [eess.SP] UPDATED)
    (0 min) In order to meet the extremely heterogeneous requirements of the next generation wireless communication networks, research community is increasingly dependent on using machine learning solutions for real-time decision-making and radio resource management. Traditional machine learning employs fully centralized architecture in which the entire training data is collected at one node e.g., cloud server, that significantly increases the communication overheads and also raises severe privacy concerns. Towards this end, a distributed machine learning paradigm termed as Federated learning (FL) has been proposed recently. In FL, each participating edge device trains its local model by using its own training data. Then, via the wireless channels the weights or parameters of the locally trained models are sent to the central PS, that aggregates them and updates the global model. On one hand, FL plays an important role for optimizing the resources of wireless communication networks, on the other hand, wireless communications is crucial for FL. Thus, a `bidirectional' relationship exists between FL and wireless communications. Although FL is an emerging concept, many publications have already been published in the domain of FL and its applications for next generation wireless networks. Nevertheless, we noticed that none of the works have highlighted the bidirectional relationship between FL and wireless communications. Therefore, the purpose of this survey paper is to bridge this gap in literature by providing a timely and comprehensive discussion on the interdependency between FL and wireless communications.
    Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks. (arXiv:2007.02901v2 [cs.LG] UPDATED)
    (0 min) We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .
    Enabling Fast Deep Learning on Tiny Energy-Harvesting IoT Devices. (arXiv:2111.14051v2 [cs.LG] UPDATED)
    (0 min) Energy harvesting (EH) IoT devices that operate intermittently without batteries, coupled with advances in deep neural networks (DNNs), have opened up new opportunities for enabling sustainable smart applications. Nevertheless, implementing those computation and memory-intensive intelligent algorithms on EH devices is extremely difficult due to the challenges of limited resources and intermittent power supply that causes frequent failures. To address those challenges, this paper proposes a methodology that enables fast deep learning with low-energy accelerators for tiny energy harvesting devices. We first propose $RAD$, a resource-aware structured DNN training framework, which employs block circulant matrix and structured pruning to achieve high compression for leveraging the advantage of various vector operation accelerators. A DNN implementation method, $ACE$, is then proposed that employs low-energy accelerators to profit maximum performance with small energy consumption. Finally, we further design $FLEX$, the system support for intermittent computation in energy harvesting situations. Experimental results from three different DNN models demonstrate that $RAD$, $ACE$, and $FLEX$ can enable fast and correct inference on energy harvesting devices with up to 4.26X runtime reduction, up to 7.7X energy reduction with higher accuracy over the state-of-the-art.
    Tiny Adversarial Mulit-Objective Oneshot Neural Architecture Search. (arXiv:2103.00363v2 [cs.LG] UPDATED)
    (0 min) Due to limited computational cost and energy consumption, most neural network models deployed in mobile devices are tiny. However, tiny neural networks are commonly very vulnerable to attacks. Current research has proved that larger model size can improve robustness, but little research focuses on how to enhance the robustness of tiny neural networks. Our work focuses on how to improve the robustness of tiny neural networks without seriously deteriorating of clean accuracy under mobile-level resources. To this end, we propose a multi-objective oneshot network architecture search (NAS) algorithm to obtain the best trade-off networks in terms of the adversarial accuracy, the clean accuracy and the model size. Specifically, we design a novel search space based on new tiny blocks and channels to balance model size and adversarial performance. Moreover, since the supernet significantly affects the performance of subnets in our NAS algorithm, we reveal the insights into how the supernet helps to obtain the best subnet under white-box adversarial attacks. Concretely, we explore a new adversarial training paradigm by analyzing the adversarial transferability, the width of the supernet and the difference between training the subnets from scratch and fine-tuning. Finally, we make a statistical analysis for the layer-wise combination of certain blocks and channels on the first non-dominated front, which can serve as a guideline to design tiny neural network architectures for the resilience of adversarial perturbations.
    A Survey on Deep Learning and Explainability for Automatic Report Generation from Medical Images. (arXiv:2010.10563v2 [cs.CV] UPDATED)
    (0 min) Every year physicians face an increasing demand of image-based diagnosis from patients, a problem that can be addressed with recent artificial intelligence methods. In this context, we survey works in the area of automatic report generation from medical images, with emphasis on methods using deep neural networks, with respect to: (1) Datasets, (2) Architecture Design, (3) Explainability and (4) Evaluation Metrics. Our survey identifies interesting developments, but also remaining challenges. Among them, the current evaluation of generated reports is especially weak, since it mostly relies on traditional Natural Language Processing (NLP) metrics, which do not accurately capture medical correctness.
    Effective Sample Size, Dimensionality, and Generalization in Covariate Shift Adaptation. (arXiv:2010.01184v5 [stat.ML] UPDATED)
    (0 min) In supervised learning, training and test datasets are often sampled from distinct distributions. Domain adaptation techniques are thus required. Covariate shift adaptation yields good generalization performance when domains differ only by the marginal distribution of features. Covariate shift adaptation is usually implemented using importance weighting, which may fail, according to common wisdom, due to small effective sample sizes (ESS). Previous research argues this scenario is more common in high-dimensional settings. However, how effective sample size, dimensionality, and model performance/generalization are formally related in supervised learning, considering the context of covariate shift adaptation, is still somewhat obscure in the literature. Thus, a main challenge is presenting a unified theory connecting those points. Hence, in this paper, we focus on building a unified view connecting the ESS, data dimensionality, and generalization in the context of covariate shift adaptation. Moreover, we also demonstrate how dimensionality reduction or feature selection can increase the ESS, and argue that our results support dimensionality reduction before covariate shift adaptation as a good practice.
    COIN: Counterfactual Image Generation for VQA Interpretation. (arXiv:2201.03342v1 [cs.CV])
    (0 min) Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
    Addressing the Lack of Comparability & Testing in CAN Intrusion Detection Research: A Comprehensive Guide to CAN IDS Data & Introduction of the ROAD Dataset. (arXiv:2012.14600v2 [cs.CR] UPDATED)
    (0 min) Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions on CANs. Producing vehicular CAN data with a variety of intrusions is out of reach for most researchers as it requires expensive assets and expertise. To assist researchers, we present the first comprehensive guide to the existing open CAN intrusion datasets, including a quality analysis of each dataset and an enumeration of each's benefits, drawbacks, and suggested use case. Current public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, which lack fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but not a corresponding raw binary version. Overall, the available data pigeon-holes CAN IDS works into testing on limited, often inappropriate data (usually with attacks that are too easily detectable to truly test the method), and this lack data has stymied comparability and reproducibility of results. As our primary contribution, we present the ROAD (Real ORNL Automotive Dynamometer) CAN Intrusion Dataset, consisting of over 3.5 hours of one vehicle's CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real fuzzing, fabrication, and unique advanced attacks, as well as simulated masquerade attacks. To facilitate benchmarking CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS field.
    Project Achoo: A Practical Model and Application for COVID-19 Detection from Recordings of Breath, Voice, and Cough. (arXiv:2107.10716v2 [eess.SP] UPDATED)
    (0 min) The COVID-19 pandemic created a significant interest and demand for infection detection and monitoring solutions. In this paper we propose a machine learning method to quickly triage COVID-19 using recordings made on consumer devices. The approach combines signal processing methods with fine-tuned deep learning networks and provides methods for signal denoising, cough detection and classification. We have also developed and deployed a mobile application that uses symptoms checker together with voice, breath and cough signals to detect COVID-19 infection. The application showed robust performance on both open sourced datasets and on the noisy data collected during beta testing by the end users.
    BIGPrior: Towards Decoupling Learned Prior Hallucination and Data Fidelity in Image Restoration. (arXiv:2011.01406v3 [cs.CV] UPDATED)
    (0 min) Classic image-restoration algorithms use a variety of priors, either implicitly or explicitly. Their priors are hand-designed and their corresponding weights are heuristically assigned. Hence, deep learning methods often produce superior image restoration quality. Deep networks are, however, capable of inducing strong and hardly predictable hallucinations. Networks implicitly learn to be jointly faithful to the observed data while learning an image prior; and the separation of original data and hallucinated data downstream is then not possible. This limits their wide-spread adoption in image restoration. Furthermore, it is often the hallucinated part that is victim to degradation-model overfitting. We present an approach with decoupled network-prior based hallucination and data fidelity terms. We refer to our framework as the Bayesian Integration of a Generative Prior (BIGPrior). Our method is rooted in a Bayesian framework and tightly connected to classic restoration methods. In fact, it can be viewed as a generalization of a large family of classic restoration algorithms. We use network inversion to extract image prior information from a generative network. We show that, on image colorization, inpainting and denoising, our framework consistently improves the inversion results. Our method, though partly reliant on the quality of the generative network inversion, is competitive with state-of-the-art supervised and task-specific restoration methods. It also provides an additional metric that sets forth the degree of prior reliance per pixel relative to data fidelity.
    Applications of Deep Neural Networks. (arXiv:2009.05673v4 [cs.LG] UPDATED)
    (0 min) Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Deep learning allows a neural network to learn hierarchies of information in a way that is like the function of the human brain. This course will introduce the student to classic neural network structures, Convolution Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Neural Networks (GRU), General Adversarial Networks (GAN), and reinforcement learning. Application of these architectures to computer vision, time series, security, natural language processing (NLP), and data generation will be covered. High-Performance Computing (HPC) aspects will demonstrate how deep learning can be leveraged both on graphical processing units (GPUs), as well as grids. Focus is primarily upon the application of deep learning to problems, with some introduction to mathematical foundations. Readers will use the Python programming language to implement deep learning using Google TensorFlow and Keras. It is not necessary to know Python prior to this book; however, familiarity with at least one programming language is assumed.
    Small Object Detection using Deep Learning. (arXiv:2201.03243v1 [cs.CV])
    (0 min) Now a days, UAVs such as drones are greatly used for various purposes like that of capturing and target detection from ariel imagery etc. Easy access of these small ariel vehicles to public can cause serious security threats. For instance, critical places may be monitored by spies blended in public using drones. Study in hand proposes an improved and efficient Deep Learning based autonomous system which can detect and track very small drones with great precision. The proposed system consists of a custom deep learning model Tiny YOLOv3, one of the flavors of very fast object detection model You Look Only Once (YOLO) is built and used for detection. The object detection algorithm will efficiently the detect the drones. The proposed architecture has shown significantly better performance as compared to the previous YOLO version. The improvement is observed in the terms of resource usage and time complexity. The performance is measured using the metrics of recall and precision that are 93% and 91% respectively.
    Autonomous Kinetic Modeling of Biomass Pyrolysis using Chemical Reaction Neural Networks. (arXiv:2105.11397v2 [physics.chem-ph] UPDATED)
    (0 min) Modeling the burning processes of biomass such as wood, grass, and crops is crucial for the modeling and prediction of wildland and urban fire behavior. Despite its importance, the burning of solid fuels remains poorly understood, which can be partly attributed to the unknown chemical kinetics of most solid fuels. Most available kinetic models were built upon expert knowledge, which requires chemical insights and years of experience. This work presents a framework for autonomously discovering biomass pyrolysis kinetic models from thermogravimetric analyzer (TGA) experimental data using the recently developed chemical reaction neural networks (CRNN). The approach incorporated the CRNN model into the framework of neural ordinary differential equations to predict the residual mass in TGA data. In addition to the flexibility of neural-network-based models, the learned CRNN model is interpretable, by incorporating the fundamental physics laws, such as the law of mass action and Arrhenius law, into the neural network structure. The learned CRNN model can then be translated into the classical forms of biomass chemical kinetic models, which facilitates the extraction of chemical insights and the integration of the kinetic model into large-scale fire simulations. We demonstrated the effectiveness of the framework in predicting the pyrolysis and oxidation of cellulose. This successful demonstration opens the possibility of rapid and autonomous chemical kinetic modeling of solid fuels, such as wildfire fuels and industrial polymers.
    Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem. (arXiv:2112.09382v2 [cs.SD] UPDATED)
    (0 min) Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem, with great flexibility and strong potential. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized. Evaluation results based on the WSJ0-2mix and VCTK-noisy corpora in various settings show that our proposed method can steadily synthesize the separated speech with high speech quality and without any interference, which is difficult to avoid in regression-based methods. In addition, with negligible loss of listening quality, the speaker conversion of enhanced/separated speech could be easily realized through our method.
    Wind Park Power Prediction: Attention-Based Graph Networks and Deep Learning to Capture Wake Losses. (arXiv:2201.03229v1 [cs.LG])
    (0 min) With the increased penetration of wind energy into the power grid, it has become increasingly important to be able to predict the expected power production for larger wind farms. Deep learning (DL) models can learn complex patterns in the data and have found wide success in predicting wake losses and expected power production. This paper proposes a modular framework for attention-based graph neural networks (GNN), where attention can be applied to any desired component of a graph block. The results show that the model significantly outperforms a multilayer perceptron (MLP) and a bidirectional LSTM (BLSTM) model, while delivering performance on-par with a vanilla GNN model. Moreover, we argue that the proposed graph attention architecture can easily adapt to different applications by offering flexibility into the desired attention operations to be used, which might depend on the specific application. Through analysis of the attention weights, it was showed that employing attention-based GNNs can provide insights into what the models learn. In particular, the attention networks seemed to realise turbine dependencies that aligned with some physical intuition about wake losses.
    Federated Deep Learning in Electricity Forecasting: An MCDM Approach. (arXiv:2111.13834v3 [math.OC] UPDATED)
    (0 min) Large-scale data analysis is growing at an exponential rate as data proliferates in our societies. This abundance of data has the advantage of allowing the decision-maker to implement complex models in scenarios that were prohibitive before. At the same time, such an amount of data requires a distributed thinking approach. In fact, Deep Learning models require plenty of resources, and distributed training is needed. This paper presents a Multicriteria approach for distributed learning. Our approach uses the Weighted Goal Programming approach in its Chebyshev formulation to build an ensemble of decision rules that optimize aprioristically defined performance metrics. Such a formulation is beneficial because it is both model and metric agnostic and provides an interpretable output for the decision-maker. We test our approach by showing a practical application in electricity demand forecasting. Our results suggest that when we allow for dataset split overlapping, the performances of our methodology are consistently above the baseline model trained on the whole dataset.
    Towards Intrinsic Interactive Reinforcement Learning. (arXiv:2112.01575v2 [cs.AI] UPDATED)
    (0 min) Reinforcement learning (RL) and brain-computer interfaces (BCI) are two fields that have been growing over the past decade. Until recently, these fields have operated independently of one another. With the rising interest in human-in-the-loop (HITL) applications, RL algorithms have been adapted to account for human guidance giving rise to the sub-field of interactive reinforcement learning (IRL). Adjacently, BCI applications have been long interested in extracting intrinsic feedback from neural activity during human-computer interactions. These two ideas have set RL and BCI on a collision course for one another through the integration of BCI into the IRL framework where intrinsic feedback can be utilized to help train an agent. This intersection has created a new and emerging paradigm denoted as intrinsic IRL. To further help facilitate deeper ingratiation of BCI and IRL, we provide a tutorial and review of intrinsic IRL so far with an emphasis on its parent field of feedback-driven IRL along with discussions concerning validity, challenges, and open problems.
    Invariance encoding in sliced-Wasserstein space for image classification with limited training data. (arXiv:2201.02980v1 [cs.CV])
    (0 min) Deep convolutional neural networks (CNNs) are broadly considered to be state-of-the-art generic end-to-end image classification systems. However, they are known to underperform when training data are limited and thus require data augmentation strategies that render the method computationally expensive and not always effective. Rather than using a data augmentation strategy to encode invariances as typically done in machine learning, here we propose to mathematically augment a nearest subspace classification model in sliced-Wasserstein space by exploiting certain mathematical properties of the Radon Cumulative Distribution Transform (R-CDT), a recently introduced image transform. We demonstrate that for a particular type of learning problem, our mathematical solution has advantages over data augmentation with deep CNNs in terms of classification accuracy and computational complexity, and is particularly effective under a limited training data setting. The method is simple, effective, computationally efficient, non-iterative, and requires no parameters to be tuned. Python code implementing our method is available at https://github.com/rohdelab/mathematical_augmentation. Our method is integrated as a part of the software package PyTransKit, which is available at https://github.com/rohdelab/PyTransKit.
    Development of a hybrid machine-learning and optimization tool for performance-based solar shading design. (arXiv:2201.03028v1 [cs.LG])
    (0 min) Solar shading design should be done for the desired Indoor Environmental Quality (IEQ) in the early design stages. This field can be very challenging and time-consuming also requires experts, sophisticated software, and a large amount of money. The primary purpose of this research is to design a simple tool to study various models of solar shadings and make decisions easier and faster in the early stages. Database generation methods, artificial intelligence, and optimization have been used to achieve this goal. This tool includes two main parts of 1. predicting the performance of the user-selected model along with proposing effective parameters and 2. proposing optimal pre-prepared models to the user. In this regard, initially, a side-lit shoebox model with variable parameters was modeled parametrically, and five common solar shading models with their variables were applied to the space. For each solar shadings and the state without shading, metrics related to daylight and glare, view, and initial costs were simulated. The database generated in this research includes 87912 alternatives and six calculated metrics introduced to optimized machine learning models, including neural network, random Forrest, support vector regression, and k nearest neighbor. According to the results, the most accurate and fastest estimation model was Random Forrest, with an r2_score of 0.967 to 1. Then, sensitivity analysis was performed to identify the most influential parameters for each shading model and the state without it. This analysis distinguished the most effective parameters, including window orientation, WWR, room width, length, and shading depth. Finally, by optimizing the estimation function of machine learning models with the NSGA II algorithm, about 7300 optimal models were identified. The developed tool can evaluate various design alternatives in less than a few seconds for each.
    Task2Sim : Towards Effective Pre-training and Transfer from Synthetic Data. (arXiv:2112.00054v2 [cs.CV] UPDATED)
    (0 min) Pre-training models on Imagenet or other massive datasets of real images has led to major advances in computer vision, albeit accompanied with shortcomings related to curation cost, privacy, usage rights, and ethical issues. In this paper, for the first time, we study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks from very different domains. In using such synthetic data for pre-training, we find that downstream performance on different tasks are favored by different configurations of simulation parameters (e.g. lighting, object pose, backgrounds, etc.), and that there is no one-size-fits-all solution. It is thus better to tailor synthetic pre-training data to a specific downstream task, for best performance. We introduce Task2Sim, a unified model mapping downstream task representations to optimal simulation parameters to generate synthetic pre-training data for them. Task2Sim learns this mapping by training to find the set of best parameters on a set of "seen" tasks. Once trained, it can then be used to predict best simulation parameters for novel "unseen" tasks in one shot, without requiring additional training. Given a budget in number of images per class, our extensive experiments with 20 diverse downstream tasks show Task2Sim's task-adaptive pre-training data results in significantly better downstream performance than non-adaptively choosing simulation parameters on both seen and unseen tasks. It is even competitive with pre-training on real images from Imagenet.
    Lung infection and normal region segmentation from CT volumes of COVID-19 cases. (arXiv:2201.03050v1 [eess.IV])
    (0 min) This paper proposes an automated segmentation method of infection and normal regions in the lung from CT volumes of COVID-19 patients. From December 2019, novel coronavirus disease 2019 (COVID-19) spreads over the world and giving significant impacts to our economic activities and daily lives. To diagnose the large number of infected patients, diagnosis assistance by computers is needed. Chest CT is effective for diagnosis of viral pneumonia including COVID-19. A quantitative analysis method of condition of the lung from CT volumes by computers is required for diagnosis assistance of COVID-19. This paper proposes an automated segmentation method of infection and normal regions in the lung from CT volumes using a COVID-19 segmentation fully convolutional network (FCN). In diagnosis of lung diseases including COVID-19, analysis of conditions of normal and infection regions in the lung is important. Our method recognizes and segments lung normal and infection regions in CT volumes. To segment infection regions that have various shapes and sizes, we introduced dense pooling connections and dilated convolutions in our FCN. We applied the proposed method to CT volumes of COVID-19 cases. From mild to severe cases of COVID-19, the proposed method correctly segmented normal and infection regions in the lung. Dice scores of normal and infection regions were 0.911 and 0.753, respectively.
    The Logic of Graph Neural Networks. (arXiv:2104.14624v2 [cs.LG] UPDATED)
    (0 min) Graph neural networks (GNNs) are deep learning architectures for machine learning problems on graphs. It has recently been shown that the expressiveness of GNNs can be characterised precisely by the combinatorial Weisfeiler-Leman algorithms and by finite variable counting logics. The correspondence has even led to new, higher-order GNNs corresponding to the WL algorithm in higher dimensions. The purpose of this paper is to explain these descriptive characterisations of GNNs.
    Differentiable and Scalable Generative Adversarial Models for Data Imputation. (arXiv:2201.03202v1 [cs.LG])
    (0 min) Data imputation has been extensively explored to solve the missing data problem. The dramatically increasing volume of incomplete data makes the imputation models computationally infeasible in many real-life applications. In this paper, we propose an effective scalable imputation system named SCIS to significantly speed up the training of the differentiable generative adversarial imputation models under accuracy-guarantees for large-scale incomplete data. SCIS consists of two modules, differentiable imputation modeling (DIM) and sample size estimation (SSE). DIM leverages a new masking Sinkhorn divergence function to make an arbitrary generative adversarial imputation model differentiable, while for such a differentiable imputation model, SSE can estimate an appropriate sample size to ensure the user-specified imputation accuracy of the final model. Extensive experiments upon several real-life large-scale datasets demonstrate that, our proposed system can accelerate the generative adversarial model training by 7.1x. Using around 7.6% samples, SCIS yields competitive accuracy with the state-of-the-art imputation methods in a much shorter computation time.
    A Study on Mitigating Hard Boundaries of Decision-Tree-based Uncertainty Estimates for AI Models. (arXiv:2201.03263v1 [cs.LG])
    (0 min) Outcomes of data-driven AI models cannot be assumed to be always correct. To estimate the uncertainty in these outcomes, the uncertainty wrapper framework has been proposed, which considers uncertainties related to model fit, input quality, and scope compliance. Uncertainty wrappers use a decision tree approach to cluster input quality related uncertainties, assigning inputs strictly to distinct uncertainty clusters. Hence, a slight variation in only one feature may lead to a cluster assignment with a significantly different uncertainty. Our objective is to replace this with an approach that mitigates hard decision boundaries of these assignments while preserving interpretability, runtime complexity, and prediction performance. Five approaches were selected as candidates and integrated into the uncertainty wrapper framework. For the evaluation based on the Brier score, datasets for a pedestrian detection use case were generated using the CARLA simulator and YOLOv3. All integrated approaches achieved a softening, i.e., smoothing, of uncertainty estimation. Yet, compared to decision trees, they are not so easy to interpret and have higher runtime complexity. Moreover, some components of the Brier score impaired while others improved. Most promising regarding the Brier score were random forests. In conclusion, softening hard decision tree boundaries appears to be a trade-off decision.
    AutoDEUQ: Automated Deep Ensemble with Uncertainty Quantification. (arXiv:2110.13511v2 [cs.LG] UPDATED)
    (0 min) Deep neural networks are powerful predictors for a variety of tasks. However, they do not capture uncertainty directly. Using neural network ensembles to quantify uncertainty is competitive with approaches based on Bayesian neural networks while benefiting from better computational scalability. However, building ensembles of neural networks is a challenging task because, in addition to choosing the right neural architecture or hyperparameters for each member of the ensemble, there is an added cost of training each model. We propose AutoDEUQ, an automated approach for generating an ensemble of deep neural networks. Our approach leverages joint neural architecture and hyperparameter search to generate ensembles. We use the law of total variance to decompose the predictive variance of deep ensembles into aleatoric (data) and epistemic (model) uncertainties. We show that AutoDEUQ outperforms probabilistic backpropagation, Monte Carlo dropout, deep ensemble, distribution-free ensembles, and hyper ensemble methods on a number of regression benchmarks.
    Non-Asymptotic Guarantees for Robust Statistical Learning under $(1+\varepsilon)$-th Moment Assumption. (arXiv:2201.03182v1 [stat.ML])
    (0 min) There has been a surge of interest in developing robust estimators for models with heavy-tailed data in statistics and machine learning. This paper proposes a log-truncated M-estimator for a large family of statistical regressions and establishes its excess risk bound under the condition that the data have $(1+\varepsilon)$-th moment with $\varepsilon \in (0,1]$. With an additional assumption on the associated risk function, we obtain an $\ell_2$-error bound for the estimation. Our theorems are applied to establish robust M-estimators for concrete regressions. Besides convex regressions such as quantile regression and generalized linear models, many non-convex regressions can also be fit into our theorems, we focus on robust deep neural network regressions, which can be solved by the stochastic gradient descent algorithms. Simulations and real data analysis demonstrate the superiority of log-truncated estimations over standard estimations.
    Loss-calibrated expectation propagation for approximate Bayesian decision-making. (arXiv:2201.03128v1 [stat.ML])
    (0 min) Approximate Bayesian inference methods provide a powerful suite of tools for finding approximations to intractable posterior distributions. However, machine learning applications typically involve selecting actions, which -- in a Bayesian setting -- depend on the posterior distribution only via its contribution to expected utility. A growing body of work on loss-calibrated approximate inference methods has therefore sought to develop posterior approximations sensitive to the influence of the utility function. Here we introduce loss-calibrated expectation propagation (Loss-EP), a loss-calibrated variant of expectation propagation. This method resembles standard EP with an additional factor that "tilts" the posterior towards higher-utility decisions. We show applications to Gaussian process classification under binary utility functions with asymmetric penalties on False Negative and False Positive errors, and show how this asymmetry can have dramatic consequences on what information is "useful" to capture in an approximation.
    Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. (arXiv:2111.02840v2 [cs.CL] UPDATED)
    (0 min) Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io.
    Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition. (arXiv:2107.00606v6 [cs.CV] UPDATED)
    (0 min) Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time, short-time HAR. The proposed methodology was extensively tested on MPOSE2021 and compared to several state-of-the-art architectures, proving the effectiveness of the AcT model and laying the foundations for future work on HAR.
    Improving Fairness for Data Valuation in Federated Learning. (arXiv:2109.09046v2 [cs.LG] UPDATED)
    (0 min) Federated learning is an emerging decentralized machine learning scheme that allows multiple data owners to work collaboratively while ensuring data privacy. The success of federated learning depends largely on the participation of data owners. To sustain and encourage data owners' participation, it is crucial to fairly evaluate the quality of the data provided by the data owners and reward them correspondingly. Federated Shapley value, recently proposed by Wang et al. [Federated Learning, 2020], is a measure for data value under the framework of federated learning that satisfies many desired properties for data valuation. However, there are still factors of potential unfairness in the design of federated Shapley value because two data owners with the same local data may not receive the same evaluation. We propose a new measure called completed federated Shapley value to improve the fairness of federated Shapley value. The design depends on completing a matrix consisting of all the possible contributions by different subsets of the data owners. It is shown under mild conditions that this matrix is approximately low-rank by leveraging concepts and tools from optimization. Both theoretical analysis and empirical evaluation verify that the proposed measure does improve fairness in many circumstances.
    A Survey on Face Recognition Systems. (arXiv:2201.02991v1 [cs.CV])
    (0 min) Face Recognition has proven to be one of the most successful technology and has impacted heterogeneous domains. Deep learning has proven to be the most successful at computer vision tasks because of its convolution-based architecture. Since the advent of deep learning, face recognition technology has had a substantial increase in its accuracy. In this paper, some of the most impactful face recognition systems were surveyed. Firstly, the paper gives an overview of a general face recognition system. Secondly, the survey covers various network architectures and training losses that have had a substantial impact. Finally, the paper talks about various databases that are used to evaluate the capabilities of a face recognition system.
    Video-based fully automatic assessment of open surgery suturing skills. (arXiv:2110.13972v2 [cs.CV] UPDATED)
    (0 min) The goal of this study was to develop new reliable open surgery suturing simulation system for training medical students in situation where resources are limited or in the domestic setup. Namely, we developed an algorithm for tools and hands localization as well as identifying the interactions between them based on simple webcam video data, calculating motion metrics for assessment of surgical skill. Twenty-five participants performed multiple suturing tasks using our simulator. The YOLO network has been modified to a multi-task network, for the purpose of tool localization and tool-hand interaction detection. This was accomplished by splitting the YOLO detection heads so that they supported both tasks with minimal addition to computer run-time. Furthermore, based on the outcome of the system, motion metrics were calculated. These metrics included traditional metrics such as time and path length as well as new metrics assessing the technique participants use for holding the tools. The dual-task network performance was similar to that of two networks, while computational load was only slightly bigger than one network. In addition, the motion metrics showed significant differences between experts and novices. While video capture is an essential part of minimally invasive surgery, it is not an integral component of open surgery. Thus, new algorithms, focusing on the unique challenges open surgery videos present, are required. In this study, a dual-task network was developed to solve both a localization task and a hand-tool interaction task. The dual network may be easily expanded to a multi-task network, which may be useful for images with multiple layers and for evaluating the interaction between these different layers.
    Preserving Domain Private Representation via Mutual Information Maximization. (arXiv:2201.03102v1 [cs.LG])
    (0 min) Recent advances in unsupervised domain adaptation have shown that mitigating the domain divergence by extracting the domain-invariant representation could significantly improve the generalization of a model to an unlabeled data domain. Nevertheless, the existing methods fail to effectively preserve the representation that is private to the label-missing domain, which could adversely affect the generalization. In this paper, we propose an approach to preserve such representation so that the latent distribution of the unlabeled domain could represent both the domain-invariant features and the individual characteristics that are private to the unlabeled domain. In particular, we demonstrate that maximizing the mutual information between the unlabeled domain and its latent space while mitigating the domain divergence can achieve such preservation. We also theoretically and empirically validate that preserving the representation that is private to the unlabeled domain is important and of necessity for the cross-domain generalization. Our approach outperforms state-of-the-art methods on several public datasets.
    Predictions of Reynolds and Nusselt numbers in turbulent convection using machine-learning models. (arXiv:2201.03200v1 [physics.flu-dyn])
    (0 min) In this paper, we develop a multivariate regression model and a neural network model to predict the Reynolds number (Re) and Nusselt number in turbulent thermal convection. We compare their predictions with those of earlier models of convection: Grossmann-Lohse~[Phys. Rev. Lett. \textbf{86}, 3316 (2001)], revised Grossmann-Lohse~[Phys. Fluids \textbf{33}, 015113 (2021)], and Pandey-Verma [Phys. Rev. E \textbf{94}, 053106 (2016)] models. We observe that although the predictions of all the models are quite close to each other, the machine learning models developed in this work provide the best match with the experimental and numerical results.
    Privacy-aware Early Detection of COVID-19 through Adversarial Training. (arXiv:2201.03004v1 [cs.LG])
    (0 min) Early detection of COVID-19 is an ongoing area of research that can help with triage, monitoring and general health assessment of potential patients and may reduce operational strain on hospitals that cope with the coronavirus pandemic. Different machine learning techniques have been used in the literature to detect coronavirus using routine clinical data (blood tests, and vital signs). Data breaches and information leakage when using these models can bring reputational damage and cause legal issues for hospitals. In spite of this, protecting healthcare models against leakage of potentially sensitive information is an understudied research area. In this work, we examine two machine learning approaches, intended to predict a patient's COVID-19 status using routinely collected and readily available clinical data. We employ adversarial training to explore robust deep learning architectures that protect attributes related to demographic information about the patients. The two models we examine in this work are intended to preserve sensitive information against adversarial attacks and information leakage. In a series of experiments using datasets from the Oxford University Hospitals, Bedfordshire Hospitals NHS Foundation Trust, University Hospitals Birmingham NHS Foundation Trust, and Portsmouth Hospitals University NHS Trust we train and test two neural networks that predict PCR test results using information from basic laboratory blood tests, and vital signs performed on a patients' arrival to hospital. We assess the level of privacy each one of the models can provide and show the efficacy and robustness of our proposed architectures against a comparable baseline. One of our main contributions is that we specifically target the development of effective COVID-19 detection models with built-in mechanisms in order to selectively protect sensitive attributes against adversarial attacks.
    Reframing Neural Networks: Deep Structure in Overcomplete Representations. (arXiv:2103.05804v2 [cs.LG] UPDATED)
    (0 min) In comparison to classical shallow representation learning techniques, deep neural networks have achieved superior performance in nearly every application benchmark. But despite their clear empirical advantages, it is still not well understood what makes them so effective. To approach this question, we introduce deep frame approximation: a unifying framework for constrained representation learning with structured overcomplete frames. While exact inference requires iterative optimization, it may be approximated by the operations of a feed-forward deep neural network. We indirectly analyze how model capacity relates to frame structures induced by architectural hyperparameters such as depth, width, and skip connections. We quantify these structural differences with the deep frame potential, a data-independent measure of coherence linked to representation uniqueness and stability. As a criterion for model selection, we show correlation with generalization error on a variety of common deep network architectures and datasets. We also demonstrate how recurrent networks implementing iterative optimization algorithms can achieve performance comparable to their feed-forward approximations while improving adversarial robustness. This connection to the established theory of overcomplete representations suggests promising new directions for principled deep network architecture design with less reliance on ad-hoc engineering.
    Learning Meta Representations for Agents in Multi-Agent Reinforcement Learning. (arXiv:2108.12988v2 [cs.LG] UPDATED)
    (0 min) In multi-agent reinforcement learning, the behaviors that agents learn in a single Markov Game (MG) are typically confined to the given agent number (i.e., population size). Every single MG induced by varying population sizes may possess distinct optimal joint strategies and game-specific knowledge, which are modeled independently in modern multi-agent algorithms. In this work, we focus on creating agents that generalize across population-varying MGs. Instead of learning a unimodal policy, each agent learns a policy set that is formed by effective strategies across a variety of games. We propose Meta Representations for Agents (MRA) that explicitly models the game-common and game-specific strategic knowledge. By representing the policy sets with multi-modal latent policies, the common strategic knowledge and diverse strategic modes are discovered with an iterative optimization procedure. We prove that as an approximation to a constrained mutual information maximization objective, the learned policies can reach Nash Equilibrium in every evaluation MG under the assumption of Lipschitz game on a sufficiently large latent space. When deploying it at practical latent models with limited size, fast adaptation can be achieved by leveraging the first-order gradient information. Extensive experiments show the effectiveness of MRA on both training performance and generalization ability in hard and unseen games.
    Integration of Explainable Artificial Intelligence to Identify Significant Landslide Causal Factors for Extreme Gradient Boosting based Landslide Susceptibility Mapping with Improved Feature Selection. (arXiv:2201.03225v1 [cs.LG])
    (0 min) Landslides have been a regular occurrence and an alarming threat to human life and property in the era of anthropogenic global warming. An early prediction of landslide susceptibility using a data-driven approach is a demand of time. In this study, we explored the eloquent features that best describe landslide susceptibility with state-of-the-art machine learning methods. In our study, we employed state-of-the-art machine learning algorithms including XgBoost, LR, KNN, SVM, Adaboost for landslide susceptibility prediction. To find the best hyperparameters of each individual classifier for optimized performance, we have incorporated the Grid Search method, with 10 Fold Cross-Validation. In this context, the optimized version of XgBoost outperformed all other classifiers with a Cross-validation Weighted F1 score of 94.62%. Followed by this empirical evidence, we explored the XgBoost classifier by incorporating TreeSHAP and identified eloquent features such as SLOPE, ELEVATION, TWI that complement the performance of the XGBoost classifier mostly and features such as LANDUSE, NDVI, SPI which has less effect on models performance. According to the TreeSHAP explanation of features, we selected the 9 most significant landslide causal factors out of 15. Evidently, an optimized version of XgBoost along with feature reduction by 40%, has outperformed all other classifiers in terms of popular evaluation metrics with a Cross-Validation Weighted F1 score of 95.01% on the training and AUC score of 97%.
    Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay. (arXiv:2201.03019v1 [cs.LG])
    (0 min) Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data. Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process. However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy. Therefore, a practical data-free KD method should be robust and ideally provide monotonically increasing student accuracy during distillation. This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data. A straightforward approach to overcome this issue is to store and rehearse the generated samples periodically, which increases the memory footprint and creates privacy concerns. We propose to model the distribution of the previously observed synthetic samples with a generative network. In particular, we design a Variational Autoencoder (VAE) with a training objective that is customized to learn the synthetic data representations optimally. The student is rehearsed by the generative pseudo replay technique, with samples produced by the VAE. Hence knowledge degradation can be prevented without storing any samples. Experiments on image classification benchmarks show that our method optimizes the expected value of the distilled model accuracy while eliminating the large memory overhead incurred by the sample-storing methods.
    HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis. (arXiv:2102.06940v2 [cs.NE] UPDATED)
    (0 min) Comprehensive benchmarking of clustering algorithms is rendered difficult by two key factors: (i)~the elusiveness of a unique mathematical definition of this unsupervised learning approach and (ii)~dependencies between the generating models or clustering criteria adopted by some clustering algorithms and indices for internal cluster validation. Consequently, there is no consensus regarding the best practice for rigorous benchmarking, and whether this is possible at all outside the context of a given application. Here, we argue that synthetic datasets must continue to play an important role in the evaluation of clustering algorithms, but that this necessitates constructing benchmarks that appropriately cover the diverse set of properties that impact clustering algorithm performance. Through our framework, HAWKS, we demonstrate the important role evolutionary algorithms play to support flexible generation of such benchmarks, allowing simple modification and extension. We illustrate two possible uses of our framework: (i)~the evolution of benchmark data consistent with a set of hand-derived properties and (ii)~the generation of datasets that tease out performance differences between a given pair of algorithms. Our work has implications for the design of clustering benchmarks that sufficiently challenge a broad range of algorithms, and for furthering insight into the strengths and weaknesses of specific approaches.
    Noisy Neonatal Chest Sound Separation for High-Quality Heart and Lung Sounds. (arXiv:2201.03211v1 [eess.AS])
    (0 min) Stethoscope-recorded chest sounds provide the opportunity for remote cardio-respiratory health monitoring of neonates. However, reliable monitoring requires high-quality heart and lung sounds. This paper presents novel Non-negative Matrix Factorisation (NMF) and Non-negative Matrix Co-Factorisation (NMCF) methods for neonatal chest sound separation. To assess these methods and compare with existing single-source separation methods, an artificial mixture dataset was generated comprising of heart, lung and noise sounds. Signal-to-noise ratios were then calculated for these artificial mixtures. These methods were also tested on real-world noisy neonatal chest sounds and assessed based on vital sign estimation error and a signal quality score of 1-5 developed in our previous works. Additionally, the computational cost of all methods was assessed to determine the applicability for real-time processing. Overall, both the proposed NMF and NMCF methods outperform the next best existing method by 2.7dB to 11.6dB for the artificial dataset and 0.40 to 1.12 signal quality improvement for the real-world dataset. The median processing time for the sound separation of a 10s recording was found to be 28.3s for NMCF and 342ms for NMF. Because of stable and robust performance, we believe that our proposed methods are useful to denoise neonatal heart and lung sound in a real-world environment. Codes for proposed and existing methods can be found at: https://github.com/egrooby-monash/Heart-and-Lung-Sound-Separation.
    Online Deterministic Annealing for Classification and Clustering. (arXiv:2102.05836v4 [cs.LG] UPDATED)
    (0 min) Inherent in virtually every iterative machine learning algorithm is the problem of hyper-parameter tuning, which includes three major design parameters: (a) the complexity of the model, e.g., the number of neurons in a neural network, (b) the initial conditions, which heavily affect the behavior of the algorithm, and (c) the dissimilarity measure used to quantify its performance. We introduce an online prototype-based learning algorithm that can be viewed as a progressively growing competitive-learning neural network architecture for classification and clustering. The learning rule of the proposed approach is formulated as an online gradient-free stochastic approximation algorithm that solves a sequence of appropriately defined optimization problems, simulating an annealing process. The annealing nature of the algorithm contributes to avoiding poor local minima, offers robustness with respect to the initial conditions, and provides a means to progressively increase the complexity of the learning model, through an intuitive bifurcation phenomenon. The proposed approach is interpretable, requires minimal hyper-parameter tuning, and allows online control over the performance-complexity trade-off. Finally, we show that Bregman divergences appear naturally as a family of dissimilarity measures that play a central role in both the performance and the computational complexity of the learning algorithm.
    Information-Theoretic Bias Reduction via Causal View of Spurious Correlation. (arXiv:2201.03121v1 [cs.LG])
    (0 min) We propose an information-theoretic bias measurement technique through a causal interpretation of spurious correlation, which is effective to identify the feature-level algorithmic bias by taking advantage of conditional mutual information. Although several bias measurement methods have been proposed and widely investigated to achieve algorithmic fairness in various tasks such as face recognition, their accuracy- or logit-based metrics are susceptible to leading to trivial prediction score adjustment rather than fundamental bias reduction. Hence, we design a novel debiasing framework against the algorithmic bias, which incorporates a bias regularization loss derived by the proposed information-theoretic bias measurement approach. In addition, we present a simple yet effective unsupervised debiasing technique based on stochastic label noise, which does not require the explicit supervision of bias information. The proposed bias measurement and debiasing approaches are validated in diverse realistic scenarios through extensive experiments on multiple standard benchmarks.
    Graph Representation Learning for Multi-Task Settings: a Meta-Learning Approach. (arXiv:2201.03326v1 [cs.LG])
    (0 min) Graph Neural Networks (GNNs) have become the state-of-the-art method for many applications on graph structured data. GNNs are a framework for graph representation learning, where a model learns to generate low dimensional node embeddings that encapsulate structural and feature-related information. GNNs are usually trained in an end-to-end fashion, leading to highly specialized node embeddings. While this approach achieves great results in the single-task setting, generating node embeddings that can be used to perform multiple tasks (with performance comparable to single-task models) is still an open problem. We propose a novel training strategy for graph representation learning, based on meta-learning, which allows the training of a GNN model capable of producing multi-task node embeddings. Our method avoids the difficulties arising when learning to perform multiple tasks concurrently by, instead, learning to quickly (i.e. with a few steps of gradient descent) adapt to multiple tasks singularly. We show that the embeddings produced by a model trained with our method can be used to perform multiple tasks with comparable or, surprisingly, even higher performance than both single-task and multi-task end-to-end models.
    Enhancing Haptic Distinguishability of Surface Materials with Boosting Technique. (arXiv:2010.02002v4 [cs.LG] UPDATED)
    (0 min) Discriminative features are crucial for several learning applications, such as object detection and classification. Neural networks are extensively used for extracting discriminative features of images and speech signals. However, the lack of large datasets in the haptics domain often limits the applicability of such techniques. This paper presents a general framework for the analysis of the discriminative properties of haptic signals. We demonstrate the effectiveness of spectral features and a boosted embedding technique in enhancing the distinguishability of haptic signals. Experiments indicate our framework needs less training data, generalizes well for different predictors, and outperforms the related state-of-the-art.
    Topographic VAEs learn Equivariant Capsules. (arXiv:2109.01394v2 [cs.LG] UPDATED)
    (0 min) In this work we seek to bridge the concepts of topographic organization and equivariance in neural networks. To accomplish this, we introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. Furthermore, through topographic organization over time (i.e. temporal coherence), we demonstrate how predefined latent space transformation operators can be encouraged for observed transformed input sequences -- a primitive form of unsupervised learned equivariance. We demonstrate that this model successfully learns sets of approximately equivariant features (i.e. "capsules") directly from sequences and achieves higher likelihood on correspondingly transforming test sequences. Equivariance is verified quantitatively by measuring the approximate commutativity of the inference network and the sequence transformations. Finally, we demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
    Understanding Layer-wise Contributions in Deep Neural Networks through Spectral Analysis. (arXiv:2111.03972v3 [cs.LG] UPDATED)
    (0 min) Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze the layer-wise spectral bias of Deep Neural Networks and relate it to the contributions of different layers in the reduction of generalization error for a given target function. We utilize the properties of Hermite polynomials and Spherical Harmonics to prove that initial layers exhibit a larger bias towards high-frequency functions defined on the unit sphere. We further provide empirical results validating our theory in high dimensional datasets for Deep Neural Networks.
    Modeling Historical AIS Data For Vessel Path Prediction: A Comprehensive Treatment. (arXiv:2001.01592v3 [cs.AI] UPDATED)
    (0 min) The prosperity of artificial intelligence has aroused intensive interests in intelligent/autonomous navigation, in which path prediction is a key functionality for decision supports, e.g. route planning, collision warning, and traffic regulation. For maritime intelligence, Automatic Identification System (AIS) plays an important role because it recently has been made compulsory for large international commercial vessels and is able to provide nearly real-time information of the vessel. Therefore AIS data based vessel path prediction is a promising way in future maritime intelligence. However, real-world AIS data collected online are just highly irregular trajectory segments (AIS message sequences) from different types of vessels and geographical regions, with possibly very low data quality. So even there are some works studying how to build a path prediction model using historical AIS data, but still, it is a very challenging problem. In this paper, we propose a comprehensive framework to model massive historical AIS trajectory segments for accurate vessel path prediction. Experimental comparisons with existing popular methods are made to validate the proposed approach and results show that our approach could outperform the baseline methods by a wide margin.
    $m^\ast$ of two-dimensional electron gas: a neural canonical transformation study. (arXiv:2201.03156v1 [cond-mat.stat-mech])
    (0 min) The quasiparticle effective mass $m^\ast$ of interacting electrons is a fundamental quantity in the Fermi liquid theory. However, the precise value of the effective mass of uniform electron gas is still elusive after decades of research. The newly developed neural canonical transformation approach arXiv:2105.08644 offers a principled way to extract the effective mass of electron gas by directly calculating the thermal entropy at low temperature. The approach models a variational many-electron density matrix using two generative neural networks: an autoregressive model for momentum occupation and a normalizing flow for electron coordinates. Our calculation reveals a suppression of effective mass in the two-dimensional spin-polarized electron gas, which is more pronounced than previous reports in the low-density strong-coupling region. This prediction calls for verification in two-dimensional electron gas experiments.
    GridTuner: Reinvestigate Grid Size Selection for Spatiotemporal Prediction Models [Technical Report]. (arXiv:2201.03244v1 [cs.DB])
    (0 min) With the development of traffic prediction technology, spatiotemporal prediction models have attracted more and more attention from academia communities and industry. However, most existing researches focus on reducing model's prediction error but ignore the error caused by the uneven distribution of spatial events within a region. In this paper, we study a region partitioning problem, namely optimal grid size selection problem (OGSS), which aims to minimize the real error of spatiotemporal prediction models by selecting the optimal grid size. In order to solve OGSS, we analyze the upper bound of real error of spatiotemporal prediction models and minimize the real error by minimizing its upper bound. Through in-depth analysis, we find that the upper bound of real error will decrease then increase when the number of model grids increase from 1 to the maximum allowed value. Then, we propose two algorithms, namely Ternary Search and Iterative Method, to automatically find the optimal grid size. Finally, the experiments verify that the error of prediction has the same trend as its upper bound, and the change trend of the upper bound of real error with respect to the increase of the number of model grids will decrease then increase. Meanwhile, in a case study, by selecting the optimal grid size, the order dispatching results of a state-of-the-art prediction-based algorithm can be improved up to 13.6%, which shows the effectiveness of our methods on tuning the region partition for spatiotemporal prediction models.
    Scaling Knowledge Graph Embedding Models. (arXiv:2201.02791v1 [cs.LG])
    (0 min) Developing scalable solutions for training Graph Neural Networks (GNNs) for link prediction tasks is challenging due to the high data dependencies which entail high computational cost and huge memory footprint. We propose a new method for scaling training of knowledge graph embedding models for link prediction to address these challenges. Towards this end, we propose the following algorithmic strategies: self-sufficient partitions, constraint-based negative sampling, and edge mini-batch training. Both, partitioning strategy and constraint-based negative sampling, avoid cross partition data transfer during training. In our experimental evaluation, we show that our scaling solution for GNN-based knowledge graph embedding models achieves a 16x speed up on benchmark datasets while maintaining a comparable model performance as non-distributed methods on standard metrics.
    DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation. (arXiv:2103.11109v5 [cs.LG] UPDATED)
    (0 min) Recent success of deep neural networks (DNNs) hinges on the availability of large-scale dataset; however, training on such dataset often poses privacy risks for sensitive training information. In this paper, we aim to explore the power of generative models and gradient sparsity, and propose a scalable privacy-preserving generative model DATALENS. Comparing with the standard PATE privacy-preserving framework which allows teachers to vote on one-dimensional predictions, voting on the high dimensional gradient vectors is challenging in terms of privacy preservation. As dimension reduction techniques are required, we need to navigate a delicate tradeoff space between (1) the improvement of privacy preservation and (2) the slowdown of SGD convergence. To tackle this, we take advantage of communication efficient learning and propose a novel noise compression and aggregation approach TOPAGG by combining top-k compression for dimension reduction with a corresponding noise injection mechanism. We theoretically prove that the DATALENS framework guarantees differential privacy for its generated data, and provide analysis on its convergence. To demonstrate the practical usage of DATALENS, we conduct extensive experiments on diverse datasets including MNIST, Fashion-MNIST, and high dimensional CelebA, and we show that, DATALENS significantly outperforms other baseline DP generative models. In addition, we adapt the proposed TOPAGG approach, which is one of the key building blocks in DATALENS, to DP SGD training, and show that it is able to achieve higher utility than the state-of-the-art DP SGD approach in most cases. Our code is publicly available at https://github.com/AI-secure/DataLens.
    Local Information Assisted Attention-free Decoder for Audio Captioning. (arXiv:2201.03217v1 [cs.SD])
    (0 min) Automated audio captioning (AAC) aims to describe audio data with captions using natural language. Most existing AAC methods adopt an encoder-decoder structure, where the attention based mechanism is a popular choice in the decoder (e.g., Transformer decoder) for predicting captions from audio features. Such attention based decoders can capture the global information from the audio features, however, their ability in extracting local information can be limited, which may lead to degraded quality in the generated captions. In this paper, we present an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction, and the attention-free decoder is designed to introduce local information. The proposed method enables the effective use of both global and local information from audio signals. Experiments show that our method outperforms the state-of-the-art methods with the standard attention based decoder in Task 6 of the DCASE 2021 Challenge.
    Opportunities of Hybrid Model-based Reinforcement Learning for Cell Therapy Manufacturing Process Development and Control. (arXiv:2201.03116v1 [eess.SY])
    (0 min) Driven by the key challenges of cell therapy manufacturing, including high complexity, high uncertainty, and very limited process data, we propose a stochastic optimization framework named "hybrid-RL" to efficiently guide process development and control. We first create the bioprocess probabilistic knowledge graph that is a hybrid model characterizing the understanding of biomanufacturing process mechanisms and quantifying inherent stochasticity, such as batch-to-batch variation and bioprocess noise. It can capture the key features, including nonlinear reactions, time-varying kinetics, and partially observed bioprocess state. This hybrid model can leverage on existing mechanistic models and facilitate the learning from process data. Given limited process data, a computational sampling approach is used to generate posterior samples quantifying the model estimation uncertainty. Then, we introduce hybrid model-based Bayesian reinforcement learning (RL), accounting for both inherent stochasticity and model uncertainty, to guide optimal, robust, and interpretable decision making, which can overcome the key challenges of cell therapy manufacturing. In the empirical study, cell therapy manufacturing examples are used to demonstrate that the proposed hybrid-RL framework can outperform the classical deterministic mechanistic model assisted process optimization.
    Benchmarks, Algorithms, and Metrics for Hierarchical Disentanglement. (arXiv:2102.05185v3 [cs.LG] UPDATED)
    (0 min) In representation learning, there has been recent interest in developing algorithms to disentangle the ground-truth generative factors behind a dataset, and metrics to quantify how fully this occurs. However, these algorithms and metrics often assume that both representations and ground-truth factors are flat, continuous, and factorized, whereas many real-world generative processes involve rich hierarchical structure, mixtures of discrete and continuous variables with dependence between them, and even varying intrinsic dimensionality. In this work, we develop benchmarks, algorithms, and metrics for learning such hierarchical representations.
    Collaborative Reflection-Augmented Autoencoder Network for Recommender Systems. (arXiv:2201.03158v1 [cs.IR])
    (0 min) As the deep learning techniques have expanded to real-world recommendation tasks, many deep neural network based Collaborative Filtering (CF) models have been developed to project user-item interactions into latent feature space, based on various neural architectures, such as multi-layer perceptron, auto-encoder and graph neural networks. However, the majority of existing collaborative filtering systems are not well designed to handle missing data. Particularly, in order to inject the negative signals in the training phase, these solutions largely rely on negative sampling from unobserved user-item interactions and simply treating them as negative instances, which brings the recommendation performance degradation. To address the issues, we develop a Collaborative Reflection-Augmented Autoencoder Network (CRANet), that is capable of exploring transferable knowledge from observed and unobserved user-item interactions. The network architecture of CRANet is formed of an integrative structure with a reflective receptor network and an information fusion autoencoder module, which endows our recommendation framework with the ability of encoding implicit user's pairwise preference on both interacted and non-interacted items. Additionally, a parametric regularization-based tied-weight scheme is designed to perform robust joint training of the two-stage CRANet model. We finally experimentally validate CRANet on four diverse benchmark datasets corresponding to two recommendation tasks, to show that debiasing the negative signals of user-item interactions improves the performance as compared to various state-of-the-art recommendation techniques. Our source code is available at https://github.com/akaxlh/CRANet.
    Near Optimality of Finite Memory Feedback Policies in Partially Observed Markov Decision Processes. (arXiv:2010.07452v2 [math.OC] UPDATED)
    (0 min) In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynamic or linear programming methods is challenging even if the original system has finite state and action spaces, since the state space of the fully observed belief-MDP model is always uncountable. Furthermore, there exist very few rigorous value function approximation and optimal policy approximation results, as regularity conditions needed often require a tedious study involving the spaces of probability measures leading to properties such as Feller continuity. In this paper, we study a planning problem for POMDPs where the system dynamics and measurement channel model are assumed to be known. We construct an approximate belief model by discretizing the belief space using only finite window information variables. We then find optimal policies for the approximate model and we rigorously establish near optimality of the constructed finite window control policies in POMDPs under mild non-linear filter stability conditions and the assumption that the measurement and action sets are finite (and the state space is real vector valued). We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound, where the rate of convergence is exponential under explicit and testable exponential filter stability conditions. While there exist many experimental results and few rigorous asymptotic convergence results, an explicit rate of convergence result is new in the literature, to our knowledge.
    Distinguishing Natural and Computer-Generated Images using Multi-Colorspace fused EfficientNet. (arXiv:2110.09428v2 [cs.CV] UPDATED)
    (0 min) The problem of distinguishing natural images from photo-realistic computer-generated ones either addresses natural images versus computer graphics or natural images versus GAN images, at a time. But in a real-world image forensic scenario, it is highly essential to consider all categories of image generation, since in most cases image generation is unknown. We, for the first time, to our best knowledge, approach the problem of distinguishing natural images from photo-realistic computer-generated images as a three-class classification task classifying natural, computer graphics, and GAN images. For the task, we propose a Multi-Colorspace fused EfficientNet model by parallelly fusing three EfficientNet networks that follow transfer learning methodology where each network operates in different colorspaces, RGB, LCH, and HSV, chosen after analyzing the efficacy of various colorspace transformations in this image forensics problem. Our model outperforms the baselines in terms of accuracy, robustness towards post-processing, and generalizability towards other datasets. We conduct psychophysics experiments to understand how accurately humans can distinguish natural, computer graphics, and GAN images where we could observe that humans find difficulty in classifying these images, particularly the computer-generated images, indicating the necessity of computational algorithms for the task. We also analyze the behavior of our model through visual explanations to understand salient regions that contribute to the model's decision making and compare with manual explanations provided by human participants in the form of region markings, where we could observe similarities in both the explanations indicating the powerful nature of our model to take the decisions meaningfully.
    Video-Specific Autoencoders for Exploring, Editing and Transmitting Videos. (arXiv:2103.17261v2 [cs.CV] UPDATED)
    (0 min) We study video-specific autoencoders that allow a human user to explore, edit, and efficiently transmit videos. Prior work has independently looked at these problems (and sub-problems) and proposed different formulations. In this work, we train a simple autoencoder (from scratch) on multiple frames of a specific video. We observe: (1) latent codes learned by a video-specific autoencoder capture spatial and temporal properties of that video; and (2) autoencoders can project out-of-sample inputs onto the video-specific manifold. These two properties allow us to explore, edit, and efficiently transmit a video using one learned representation. For e.g., linear operations on latent codes allow users to visualize the contents of a video. Associating latent codes of a video and manifold projection enables users to make desired edits. Interpolating latent codes and manifold projection allows the transmission of sparse low-res frames over a network.
    Trade-offs between membership privacy & adversarially robust learning. (arXiv:2006.04622v2 [cs.LG] UPDATED)
    (0 min) Historically, machine learning methods have not been designed with security in mind. In turn, this has given rise to adversarial examples, carefully perturbed input samples aimed to mislead detection at test time, which have been applied to attack spam and malware classification, and more recently to attack image classification. Consequently, an abundance of research has been devoted to designing machine learning methods that are robust to adversarial examples. Unfortunately, there are desiderata besides robustness that a secure and safe machine learning model must satisfy, such as fairness and privacy. Recent work by Song et al. (2019) has shown, empirically, that there exists a trade-off between robust and private machine learning models. Models designed to be robust to adversarial examples often overfit on training data to a larger extent than standard (non-robust) models. If a dataset contains private information, then any statistical test that separates training and test data by observing a model's outputs can represent a privacy breach, and if a model overfits on training data, these statistical tests become easier. In this work, we identify settings where standard models will overfit to a larger extent in comparison to robust models, and as empirically observed in previous works, settings where the opposite behavior occurs. Thus, it is not necessarily the case that privacy must be sacrificed to achieve robustness. The degree of overfitting naturally depends on the amount of data available for training. We go on to characterize how the training set size factors into the privacy risks exposed by training a robust model on a simple Gaussian data task, and show empirically that our findings hold on image classification benchmark datasets, such as CIFAR-10 and CIFAR-100.
    Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks. (arXiv:2201.03299v1 [cs.CV])
    (0 min) Several image processing tasks, such as image classification and object detection, have been significantly improved using Convolutional Neural Networks (CNN). Like ResNet and EfficientNet, many architectures have achieved outstanding results in at least one dataset by the time of their creation. A critical factor in training concerns the network's regularization, which prevents the structure from overfitting. This work analyzes several regularization methods developed in the last few years, showing significant improvements for different CNN models. The works are classified into three main areas: the first one is called "data augmentation", where all the techniques focus on performing changes in the input data. The second, named "internal changes", which aims to describe procedures to modify the feature maps generated by the neural network or the kernels. The last one, called "label", concerns transforming the labels of a given input. This work presents two main differences comparing to other available surveys about regularization: (i) the first concerns the papers gathered in the manuscript, which are not older than five years, and (ii) the second distinction is about reproducibility, i.e., all works refered here have their code available in public repositories or they have been directly implemented in some framework, such as TensorFlow or Torch.
    Applying Artificial Intelligence for Age Estimation in Digital Forensic Investigations. (arXiv:2201.03045v1 [cs.CV])
    (0 min) The precise age estimation of child sexual abuse and exploitation (CSAE) victims is one of the most significant digital forensic challenges. Investigators often need to determine the age of victims by looking at images and interpreting the sexual development stages and other human characteristics. The main priority - safeguarding children -- is often negatively impacted by a huge forensic backlog, cognitive bias and the immense psychological stress that this work can entail. This paper evaluates existing facial image datasets and proposes a new dataset tailored to the needs of similar digital forensic research contributions. This small, diverse dataset of 0 to 20-year-old individuals contains 245 images and is merged with 82 unique images from the FG-NET dataset, thus achieving a total of 327 images with high image diversity and low age range density. The new dataset is tested on the Deep EXpectation (DEX) algorithm pre-trained on the IMDB-WIKI dataset. The overall results for young adolescents aged 10 to 15 and older adolescents/adults aged 16 to 20 are very encouraging -- achieving MAEs as low as 1.79, but also suggest that the accuracy for children aged 0 to 10 needs further work. In order to determine the efficacy of the prototype, valuable input of four digital forensic experts, including two forensic investigators, has been taken into account to improve age estimation results. Further research is required to extend datasets both concerning image density and the equal distribution of factors such as gender and racial diversity.
    Structure-preserving Gaussian Process Dynamics. (arXiv:2102.01606v3 [cs.LG] UPDATED)
    (0 min) Most physical processes posses structural properties such as constant energies, volumes, and other invariants over time. When learning models of such dynamical systems, it is critical to respect these invariants to ensure accurate predictions and physically meaningful behavior. Strikingly, state-of-the-art methods in Gaussian process (GP) dynamics model learning are not addressing this issue. On the other hand, classical numerical integrators are specifically designed to preserve these crucial properties through time. We propose to combine the advantages of GPs as function approximators with structure preserving numerical integrators for dynamical systems, such as Runge-Kutta methods. These integrators assume access to the ground truth dynamics and require evaluations of intermediate and future time steps that are unknown in a learning-based scenario. This makes direct inference of the GP dynamics, with embedded numerical scheme, intractable. Our key technical contribution is the evaluation of the implicitly defined Runge-Kutta transition probability. In a nutshell, we introduce an implicit layer for GP regression, which is embedded into a variational inference-based model learning scheme.
    Enhanced Doubly Robust Learning for Debiasing Post-click Conversion Rate Estimation. (arXiv:2105.13623v3 [cs.LG] UPDATED)
    (0 min) Post-click conversion, as a strong signal indicating the user preference, is salutary for building recommender systems. However, accurately estimating the post-click conversion rate (CVR) is challenging due to the selection bias, i.e., the observed clicked events usually happen on users' preferred items. Currently, most existing methods utilize counterfactual learning to debias recommender systems. Among them, the doubly robust (DR) estimator has achieved competitive performance by combining the error imputation based (EIB) estimator and the inverse propensity score (IPS) estimator in a doubly robust way. However, inaccurate error imputation may result in its higher variance than the IPS estimator. Worse still, existing methods typically use simple model-agnostic methods to estimate the imputation error, which are not sufficient to approximate the dynamically changing model-correlated target (i.e., the gradient direction of the prediction model). To solve these problems, we first derive the bias and variance of the DR estimator. Based on it, a more robust doubly robust (MRDR) estimator has been proposed to further reduce its variance while retaining its double robustness. Moreover, we propose a novel double learning approach for the MRDR estimator, which can convert the error imputation into the general CVR estimation. Besides, we empirically verify that the proposed learning scheme can further eliminate the high variance problem of the imputation learning. To evaluate its effectiveness, extensive experiments are conducted on a semi-synthetic dataset and two real-world datasets. The results demonstrate the superiority of the proposed approach over the state-of-the-art methods. The code is available at https://github.com/guosyjlu/MRDR-DL.
    NeoNav: Improving the Generalization of Visual Navigation via Generating Next Expected Observations. (arXiv:1906.07207v4 [cs.RO] UPDATED)
    (0 min) We propose improving the cross-target and cross-scene generalization of visual navigation through learning an agent that is guided by conceiving the next observations it expects to see. This is achieved by learning a variational Bayesian model, called NeoNav, which generates the next expected observations (NEO) conditioned on the current observations of the agent and the target view. Our generative model is learned through optimizing a variational objective encompassing two key designs. First, the latent distribution is conditioned on current observations and the target view, leading to a model-based, target-driven navigation. Second, the latent space is modeled with a Mixture of Gaussians conditioned on the current observation and the next best action. Our use of mixture-of-posteriors prior effectively alleviates the issue of over-regularized latent space, thus significantly boosting the model generalization for new targets and in novel scenes. Moreover, the NEO generation models the forward dynamics of agent-environment interaction, which improves the quality of approximate inference and hence benefits data efficiency. We have conducted extensive evaluations on both real-world and synthetic benchmarks, and show that our model consistently outperforms the state-of-the-art models in terms of success rate, data efficiency, and generalization.
    Morphological Analysis of Japanese Hiragana Sentences using the BI-LSTM CRF Model. (arXiv:2201.03366v1 [cs.CL])
    (0 min) This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. This technique plays an essential role in downstream applications in Japanese natural language processing systems because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
    MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound. (arXiv:2201.02639v1 [cs.CV])
    (0 min) As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong representations about videos through all constituent modalities. When finetuned, it sets a new state-of-the-art on both VCR and TVQA, outperforming prior work by 5% and 7% respectively. Ablations show that both tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why incorporating audio leads to better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.
    Compressing Models with Few Samples: Mimicking then Replacing. (arXiv:2201.02620v1 [cs.LG])
    (0 min) Few-sample compression aims to compress a big redundant model into a small compact one with only few samples. If we fine-tune models with these limited few samples directly, models will be vulnerable to overfit and learn almost nothing. Hence, previous methods optimize the compressed model layer-by-layer and try to make every layer have the same outputs as the corresponding layer in the teacher model, which is cumbersome. In this paper, we propose a new framework named Mimicking then Replacing (MiR) for few-sample compression, which firstly urges the pruned model to output the same features as the teacher's in the penultimate layer, and then replaces teacher's layers before penultimate with a well-tuned compact one. Unlike previous layer-wise reconstruction methods, our MiR optimizes the entire network holistically, which is not only simple and effective, but also unsupervised and general. MiR outperforms previous methods with large margins. Codes will be available soon.
    Reducing the Long Tail Losses in Scientific Emulations with Active Learning. (arXiv:2111.08498v2 [cs.LG] UPDATED)
    (0 min) Deep-learning-based models are increasingly used to emulate scientific simulations to accelerate scientific research. However, accurate, supervised deep learning models require huge amount of labelled data, and that often becomes the bottleneck in employing neural networks. In this work, we leveraged an active learning approach called core-set selection to actively select data, per a pre-defined budget, to be labelled for training. To further improve the model performance and reduce the training costs, we also warm started the training using a shrink-and-perturb trick. We tested on two case studies in different fields, namely galaxy halo occupation distribution modelling in astrophysics and x-ray emission spectroscopy in plasma physics, and the results are promising: we achieved competitive overall performance compared to using a random sampling baseline, and more importantly, successfully reduced the larger absolute losses, i.e. the long tail in the loss distribution, at virtually no overhead costs.
    The State of Aerial Surveillance: A Survey. (arXiv:2201.03080v1 [cs.CV])
    (0 min) The rapid emergence of airborne platforms and imaging sensors are enabling new forms of aerial surveillance due to their unprecedented advantages in scale, mobility, deployment and covert observation capabilities. This paper provides a comprehensive overview of human-centric aerial surveillance tasks from a computer vision and pattern recognition perspective. It aims to provide readers with an in-depth systematic review and technical analysis of the current state of aerial surveillance tasks using drones, UAVs and other airborne platforms. The main object of interest is humans, where single or multiple subjects are to be detected, identified, tracked, re-identified and have their behavior analyzed. More specifically, for each of these four tasks, we first discuss unique challenges in performing these tasks in an aerial setting compared to a ground-based setting. We then review and analyze the aerial datasets publicly available for each task, and delve deep into the approaches in the aerial literature and investigate how they presently address the aerial challenges. We conclude the paper with discussion on the missing gaps and open research questions to inform future research avenues.
    A novel interpretable machine learning system to generate clinical risk scores: An application for predicting early mortality or unplanned readmission in a retrospective cohort study. (arXiv:2201.03291v1 [cs.LG])
    (0 min) Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors, but such 'black box' variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach using the recently developed Shapley variable importance cloud (ShapleyVIC) that accounts for variability across models. Our approach evaluates and visualizes overall variable contributions for in-depth inference and transparent variable selection, and filters out non-significant contributors to simplify model building steps. We derive an ensemble variable ranking from variable contributions, which is easily integrated with an automated and modularized risk score generator, AutoScore, for convenient implementation. In a study of early death or unplanned readmission, ShapleyVIC selected 6 of 41 candidate variables to create a well-performing model, which had similar performance to a 16-variable model from machine-learning-based ranking.
    Teaching CNNs to mimic Human Visual Cognitive Process & regularise Texture-Shape bias. (arXiv:2006.14722v2 [cs.CV] UPDATED)
    (0 min) Recent experiments in computer vision demonstrate texture bias as the primary reason for supreme results in models employing Convolutional Neural Networks (CNNs), conflicting with early works claiming that these networks identify objects using shape. It is believed that the cost function forces the CNN to take a greedy approach and develop a proclivity for local information like texture to increase accuracy, thus failing to explore any global statistics. We propose CognitiveCNN, a new intuitive architecture, inspired from feature integration theory in psychology to utilise human interpretable feature like shape, texture, edges etc. to reconstruct, and classify the image. We define novel metrics to quantify the "relevance" of "abstract information" present in these modalities using attention maps. We further introduce a regularisation method which ensures that each modality like shape, texture etc. gets proportionate influence in a given task, as it does for reconstruction; and perform experiments to show the resulting boost in accuracy and robustness, besides imparting explainability to these CNNs for achieving superior performance in object recognition.
    Auto-Encoder based Co-Training Multi-View Representation Learning. (arXiv:2201.02978v1 [cs.LG])
    (0 min) Multi-view learning is a learning problem that utilizes the various representations of an object to mine valuable knowledge and improve the performance of learning algorithm, and one of the significant directions of multi-view learning is sub-space learning. As we known, auto-encoder is a method of deep learning, which can learn the latent feature of raw data by reconstructing the input, and based on this, we propose a novel algorithm called Auto-encoder based Co-training Multi-View Learning (ACMVL), which utilizes both complementarity and consistency and finds a joint latent feature representation of multiple views. The algorithm has two stages, the first is to train auto-encoder of each view, and the second stage is to train a supervised network. Interestingly, the two stages share the weights partly and assist each other by co-training process. According to the experimental result, we can learn a well performed latent feature representation, and auto-encoder of each view has more powerful reconstruction ability than traditional auto-encoder.
    Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity. (arXiv:2112.05883v2 [cs.CV] UPDATED)
    (0 min) Recent self-supervised video representation learning methods have found significant success by exploring essential properties of videos, e.g. speed, temporal order, etc. This work exploits an essential yet under-explored property of videos, the video continuity, to obtain supervision signals for self-supervised representation learning. Specifically, we formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation, that jointly supervise a shared backbone for video representation learning. This self-supervision approach, termed as Continuity Perception Network (CPNet), solves the three tasks altogether and encourages the backbone network to learn local and long-ranged motion and context representations. It outperforms prior arts on multiple downstream tasks, such as action recognition, video retrieval, and action localization. Additionally, the video continuity can be complementary to other coarse-grained video properties for representation learning, and integrating the proposed pretext task to prior arts can yield much performance gains.
    Fully automatic scoring of handwritten descriptive answers in Japanese language tests. (arXiv:2201.03215v1 [cs.LG])
    (0 min) This paper presents an experiment of automatically scoring handwritten descriptive answers in the trial tests for the new Japanese university entrance examination, which were made for about 120,000 examinees in 2017 and 2018. There are about 400,000 answers with more than 20 million characters. Although all answers have been scored by human examiners, handwritten characters are not labelled. We present our attempt to adapt deep neural network-based handwriting recognizers trained on a labelled handwriting dataset into this unlabeled answer set. Our proposed method combines different training strategies, ensembles multiple recognizers, and uses a language model built from a large general corpus to avoid overfitting into specific data. In our experiment, the proposed method records character accuracy of over 97% using about 2,000 verified labelled answers that account for less than 0.5% of the dataset. Then, the recognized answers are fed into a pre-trained automatic scoring system based on the BERT model without correcting misrecognized characters and providing rubric annotations. The automatic scoring system achieves from 0.84 to 0.98 of Quadratic Weighted Kappa (QWK). As QWK is over 0.8, it represents acceptable similarity of scoring between the automatic scoring system and the human examiners. These results are promising for further research on end-to-end automatic scoring of descriptive answers.
    Modeling and Analysis of Intermittent Federated Learning Over Cellular-Connected UAV Networks. (arXiv:2110.07077v2 [cs.LG] UPDATED)
    (0 min) Federated learning (FL) is a promising distributed learning technique particularly suitable for wireless learning scenarios since it can accomplish a learning task without raw data transportation so as to preserve data privacy and lower network resource consumption. However, current works on FL over wireless networks do not profoundly study the fundamental performance of FL over wireless networks that suffers from communication outage due to channel impairment and network interference. To accurately exploit the performance of FL over wireless networks, this paper proposes a novel intermittent FL model over a cellular-connected unmanned aerial vehicle (UAV) network, which characterizes communication outage from UAV (clients) to their server and data heterogeneity among the datasets at UAVs. We propose an analytically tractable framework to derive the uplink outage probability and use it to devise a simulation-based approach so as to evaluate the performance of the proposed intermittent FL model. Our findings reveal how the intermittent FL model is impacted by uplink communication outage and UAV deployment. Extensive numerical simulations are provided to show the consistency between the simulated and analytical performances of the proposed intermittent FL model.
    Meta-Generalization for Multiparty Privacy Learning to Identify Anomaly Multimedia Traffic in Graynet. (arXiv:2201.03027v1 [cs.CR])
    (0 min) Identifying anomaly multimedia traffic in cyberspace is a big challenge in distributed service systems, multiple generation networks and future internet of everything. This letter explores meta-generalization for a multiparty privacy learning model in graynet to improve the performance of anomaly multimedia traffic identification. The multiparty privacy learning model in graynet is a globally shared model that is partitioned, distributed and trained by exchanging multiparty parameters updates with preserving private data. The meta-generalization refers to discovering the inherent attributes of a learning model to reduce its generalization error. In experiments, three meta-generalization principles are tested as follows. The generalization error of the multiparty privacy learning model in graynet is reduced by changing the dimension of byte-level imbedding. Following that, the error is reduced by adapting the depth for extracting packet-level features. Finally, the error is reduced by adjusting the size of support set for preprocessing traffic-level data. Experimental results demonstrate that the proposal outperforms the state-of-the-art learning models for identifying anomaly multimedia traffic.
    An Interpretable Federated Learning-based Network Intrusion Detection Framework. (arXiv:2201.03134v1 [cs.CR])
    (0 min) Learning-based Network Intrusion Detection Systems (NIDSs) are widely deployed for defending various cyberattacks. Existing learning-based NIDS mainly uses Neural Network (NN) as a classifier that relies on the quality and quantity of cyberattack data. Such NN-based approaches are also hard to interpret for improving efficiency and scalability. In this paper, we design a new local-global computation paradigm, FEDFOREST, a novel learning-based NIDS by combining the interpretable Gradient Boosting Decision Tree (GBDT) and Federated Learning (FL) framework. Specifically, FEDFOREST is composed of multiple clients that extract local cyberattack data features for the server to train models and detect intrusions. A privacy-enhanced technology is also proposed in FEDFOREST to further defeat the privacy of the FL systems. Extensive experiments on 4 cyberattack datasets of different tasks demonstrate that FEDFOREST is effective, efficient, interpretable, and extendable. FEDFOREST ranks first in the collaborative learning and cybersecurity competition 2021 for Chinese college students.
    Uncovering the Source of Machine Bias. (arXiv:2201.03092v1 [cs.LG])
    (0 min) We develop a structural econometric model to capture the decision dynamics of human evaluators on an online micro-lending platform, and estimate the model parameters using a real-world dataset. We find two types of biases in gender, preference-based bias and belief-based bias, are present in human evaluators' decisions. Both types of biases are in favor of female applicants. Through counterfactual simulations, we quantify the effect of gender bias on loan granting outcomes and the welfare of the company and the borrowers. Our results imply that both the existence of the preference-based bias and that of the belief-based bias reduce the company's profits. When the preference-based bias is removed, the company earns more profits. When the belief-based bias is removed, the company's profits also increase. Both increases result from raising the approval probability for borrowers, especially male borrowers, who eventually pay back loans. For borrowers, the elimination of either bias decreases the gender gap of the true positive rates in the credit risk evaluation. We also train machine learning algorithms on both the real-world data and the data from the counterfactual simulations. We compare the decisions made by those algorithms to see how evaluators' biases are inherited by the algorithms and reflected in machine-based decisions. We find that machine learning algorithms can mitigate both the preference-based bias and the belief-based bias.
    Communication-Efficient Federated Learning with Acceleration of Global Momentum. (arXiv:2201.03172v1 [cs.LG])
    (0 min) Federated learning often suffers from unstable and slow convergence due to heterogeneous characteristics of participating clients. Such tendency is aggravated when the client participation ratio is low since the information collected from the clients at each round is prone to be more inconsistent. To tackle the challenge, we propose a novel federated learning framework, which improves the stability of the server-side aggregation step, which is achieved by sending the clients an accelerated model estimated with the global gradient to guide the local gradient updates. Our algorithm naturally aggregates and conveys the global update information to participants with no additional communication cost and does not require to store the past models in the clients. We also regularize local update to further reduce the bias and improve the stability of local updates. We perform comprehensive empirical studies on real data under various settings and demonstrate the remarkable performance of the proposed method in terms of accuracy and communication-efficiency compared to the state-of-the-art methods, especially with low client participation rates. Our code is available at https://github.com/ ninigapa0/FedAGM
    Differentially Private $\ell_1$-norm Linear Regression with Heavy-tailed Data. (arXiv:2201.03204v1 [cs.LG])
    (0 min) We study the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) with heavy-tailed data. Specifically, we focus on the $\ell_1$-norm linear regression in the $\epsilon$-DP model. While most of the previous work focuses on the case where the loss function is Lipschitz, here we only need to assume the variates has bounded moments. Firstly, we study the case where the $\ell_2$ norm of data has bounded second order moment. We propose an algorithm which is based on the exponential mechanism and show that it is possible to achieve an upper bound of $\tilde{O}(\sqrt{\frac{d}{n\epsilon}})$ (with high probability). Next, we relax the assumption to bounded $\theta$-th order moment with some $\theta\in (1, 2)$ and show that it is possible to achieve an upper bound of $\tilde{O}(({\frac{d}{n\epsilon}})^\frac{\theta-1}{\theta})$. Our algorithms can also be extended to more relaxed cases where only each coordinate of the data has bounded moments, and we can get an upper bound of $\tilde{O}({\frac{d}{\sqrt{n\epsilon}}})$ and $\tilde{O}({\frac{d}{({n\epsilon})^\frac{\theta-1}{\theta}}})$ in the second and $\theta$-th moment case respectively.
    TPAD: Identifying Effective Trajectory Predictions Under the Guidance of Trajectory Anomaly Detection Model. (arXiv:2201.02941v1 [cs.LG])
    (0 min) Trajectory Prediction (TP) is an important research topic in computer vision and robotics fields. Recently, many stochastic TP models have been proposed to deal with this problem and have achieved better performance than the traditional models with deterministic trajectory outputs. However, these stochastic models can generate a number of future trajectories with different qualities. They are lack of self-evaluation ability, that is, to examine the rationality of their prediction results, thus failing to guide users to identify high-quality ones from their candidate results. This hinders them from playing their best in real applications. In this paper, we make up for this defect and propose TPAD, a novel TP evaluation method based on the trajectory Anomaly Detection (AD) technique. In TPAD, we firstly combine the Automated Machine Learning (AutoML) technique and the experience in the AD and TP field to automatically design an effective trajectory AD model. Then, we utilize the learned trajectory AD model to examine the rationality of the predicted trajectories, and screen out good TP results for users. Extensive experimental results demonstrate that TPAD can effectively identify near-optimal prediction results, improving stochastic TP models' practical application effect.
    Comparing Sequential Forecasters. (arXiv:2110.00115v3 [stat.ME] UPDATED)
    (0 min) Consider two or more forecasters, each making a sequence of predictions for different events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts or outcomes were generated? This work presents a novel and rigorous answer to this question. We design a sequential inference procedure for estimating the time-varying difference in forecast quality as measured by any scoring rule. The resulting confidence intervals are nonasymptotically valid and can be continuously monitored to yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by adapting variance-adaptive supermartingales, confidence sequences, and e-processes to our setting. Motivated by Shafer and Vovk's game-theoretic probability, our coverage guarantees are also distribution-free, in the sense that they make no distributional assumptions on the forecasts or outcomes. In contrast to a recent work by Henzi and Ziegel, our tools can sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time. We demonstrate their effectiveness by comparing probability forecasts on Major League Baseball (MLB) games and statistical postprocessing methods for ensemble weather forecasts.
    Masked Gradient-Based Causal Structure Learning. (arXiv:1910.08527v3 [cs.LG] UPDATED)
    (0 min) This paper studies the problem of learning causal structures from observational data. We reformulate the Structural Equation Model (SEM) with additive noises in a form parameterized by binary graph adjacency matrix and show that, if the original SEM is identifiable, then the binary adjacency matrix can be identified up to super-graphs of the true causal graph under mild conditions. We then utilize the reformulated SEM to develop a causal structure learning method that can be efficiently trained using gradient-based optimization, by leveraging a smooth characterization on acyclicity and the Gumbel-Softmax approach to approximate the binary adjacency matrix. It is found that the obtained entries are typically near zero or one and can be easily thresholded to identify the edges. We conduct experiments on synthetic and real datasets to validate the effectiveness of the proposed method, and show that it readily includes different smooth model functions and achieves a much improved performance on most datasets considered.
    Stability Based Generalization Bounds for Exponential Family Langevin Dynamics. (arXiv:2201.03064v1 [cs.LG])
    (2 min) We study generalization bounds for noisy stochastic mini-batch iterative algorithms based on the notion of stability. Recent years have seen key advances in data-dependent generalization bounds for noisy iterative learning algorithms such as stochastic gradient Langevin dynamics (SGLD) based on stability (Mou et al., 2018; Li et al., 2020) and information theoretic approaches (Xu and Raginsky, 2017; Negrea et al., 2019; Steinke and Zakynthinou, 2020; Haghifam et al., 2020). In this paper, we unify and substantially generalize stability based generalization bounds and make three technical advances. First, we bound the generalization error of general noisy stochastic iterative algorithms (not necessarily gradient descent) in terms of expected (not uniform) stability. The expected stability can in turn be bounded by a Le Cam Style Divergence. Such bounds have a O(1/n) sample dependence unlike many existing bounds with O(1/\sqrt{n}) dependence. Second, we introduce Exponential Family Langevin Dynamics(EFLD) which is a substantial generalization of SGLD and which allows exponential family noise to be used with stochastic gradient descent (SGD). We establish data-dependent expected stability based generalization bounds for general EFLD algorithms. Third, we consider an important special case of EFLD: noisy sign-SGD, which extends sign-SGD using Bernoulli noise over {-1,+1}. Generalization bounds for noisy sign-SGD are implied by that of EFLD and we also establish optimization guarantees for the algorithm. Further, we present empirical results on benchmark datasets to illustrate that our bounds are non-vacuous and quantitatively much sharper than existing bounds.
    Fast solver for J2-perturbed Lambert problem using deep neural network. (arXiv:2201.02942v1 [math.NA])
    (2 min) This paper presents a novel and fast solver for the J2-perturbed Lambert problem. The solver consists of an intelligent initial guess generator combined with a differential correction procedure. The intelligent initial guess generator is a deep neural network that is trained to correct the initial velocity vector coming from the solution of the unperturbed Lambert problem. The differential correction module takes the initial guess and uses a forward shooting procedure to further update the initial velocity and exactly meet the terminal conditions. Eight sample forms are analyzed and compared to find the optimum form to train the neural network on the J2-perturbed Lambert problem. The accuracy and performance of this novel approach will be demonstrated on a representative test case: the solution of a multi-revolution J2-perturbed Lambert problem in the Jupiter system. We will compare the performance of the proposed approach against a classical standard shooting method and a homotopy-based perturbed Lambert algorithm. It will be shown that, for a comparable level of accuracy, the proposed method is significantly faster than the other two.
    Discriminant Analysis in Contrasting Dimensions for Polycystic Ovary Syndrome Prognostication. (arXiv:2201.03029v1 [cs.LG])
    (2 min) A lot of prognostication methodologies have been formulated for early detection of Polycystic Ovary Syndrome also known as PCOS using Machine Learning. PCOS is a binary classification problem. Dimensionality Reduction methods impact the performance of Machine Learning to a greater extent and using a Supervised Dimensionality Reduction method can give us a new edge to tackle this problem. In this paper we present Discriminant Analysis in different dimensions with Linear and Quadratic form for binary classification along with metrics. We were able to achieve good accuracy and less variation with Discriminant Analysis as compared to many commonly used classification algorithms with training accuracy reaching 97.37% and testing accuracy of 95.92% using Quadratic Discriminant Analysis. Paper also gives the analysis of data with visualizations for deeper understanding of problem.
    Hierarchical Graph-Convolutional Variational AutoEncoding for Generative Modelling of Human Motion. (arXiv:2111.12602v3 [cs.CV] UPDATED)
    (2 min) Models of human motion commonly focus either on trajectory prediction or action classification but rarely both. The marked heterogeneity and intricate compositionality of human motion render each task vulnerable to the data degradation and distributional shift common to real-world scenarios. A sufficiently expressive generative model of action could in theory enable data conditioning and distributional resilience within a unified framework applicable to both tasks. Here we propose a novel architecture based on hierarchical variational autoencoders and deep graph convolutional neural networks for generating a holistic model of action over multiple time-scales. We show this Hierarchical Graph-convolutional Variational Autoencoder (HG-VAE) to be capable of generating coherent actions, detecting out-of-distribution data, and imputing missing data by gradient ascent on the model's posterior. Trained and evaluated on H3.6M and the largest collection of open source human motion data, AMASS, we show HG-VAE can facilitate downstream discriminative learning better than baseline models.
    Systematic biases when using deep neural networks for annotating large catalogs of astronomical images. (arXiv:2201.03131v1 [astro-ph.GA])
    (2 min) Deep convolutional neural networks (DCNNs) have become the most common solution for automatic image annotation due to their non-parametric nature, good performance, and their accessibility through libraries such as TensorFlow. Among other fields, DCNNs are also a common approach to the annotation of large astronomical image databases acquired by digital sky surveys. One of the main downsides of DCNNs is the complex non-intuitive rules that make DCNNs act as a ``black box", providing annotations in a manner that is unclear to the user. Therefore, the user is often not able to know what information is used by the DCNNs for the classification. Here we demonstrate that the training of a DCNN is sensitive to the context of the training data such as the location of the objects in the sky. We show that for basic classification of elliptical and spiral galaxies, the sky location of the galaxies used for training affects the behavior of the algorithm, and leads to a small but consistent and statistically significant bias. That bias exhibits itself in the form of cosmological-scale anisotropy in the distribution of basic galaxy morphology. Therefore, while DCNNs are powerful tools for annotating images of extended sources, the construction of training sets for galaxy morphology should take into consideration more aspects than the visual appearance of the object. In any case, catalogs created with deep neural networks that exhibit signs of cosmological anisotropy should be interpreted with the possibility of consistent bias.
    COVID-19 Infection Segmentation from Chest CT Images Based on Scale Uncertainty. (arXiv:2201.03053v1 [eess.IV])
    (3 min) This paper proposes a segmentation method of infection regions in the lung from CT volumes of COVID-19 patients. COVID-19 spread worldwide, causing many infected patients and deaths. CT image-based diagnosis of COVID-19 can provide quick and accurate diagnosis results. An automated segmentation method of infection regions in the lung provides a quantitative criterion for diagnosis. Previous methods employ whole 2D image or 3D volume-based processes. Infection regions have a considerable variation in their sizes. Such processes easily miss small infection regions. Patch-based process is effective for segmenting small targets. However, selecting the appropriate patch size is difficult in infection region segmentation. We utilize the scale uncertainty among various receptive field sizes of a segmentation FCN to obtain infection regions. The receptive field sizes can be defined as the patch size and the resolution of volumes where patches are clipped from. This paper proposes an infection segmentation network (ISNet) that performs patch-based segmentation and a scale uncertainty-aware prediction aggregation method that refines the segmentation result. We design ISNet to segment infection regions that have various intensity values. ISNet has multiple encoding paths to process patch volumes normalized by multiple intensity ranges. We collect prediction results generated by ISNets having various receptive field sizes. Scale uncertainty among the prediction results is extracted by the prediction aggregation method. We use an aggregation FCN to generate a refined segmentation result considering scale uncertainty among the predictions. In our experiments using 199 chest CT volumes of COVID-19 cases, the prediction aggregation method improved the dice similarity score from 47.6% to 62.1%.
    Differentially Private Generative Adversarial Networks with Model Inversion. (arXiv:2201.03139v1 [cs.LG])
    (2 min) To protect sensitive data in training a Generative Adversarial Network (GAN), the standard approach is to use differentially private (DP) stochastic gradient descent method in which controlled noise is added to the gradients. The quality of the output synthetic samples can be adversely affected and the training of the network may not even converge in the presence of these noises. We propose Differentially Private Model Inversion (DPMI) method where the private data is first mapped to the latent space via a public generator, followed by a lower-dimensional DP-GAN with better convergent properties. Experimental results on standard datasets CIFAR10 and SVHN as well as on a facial landmark dataset for Autism screening show that our approach outperforms the standard DP-GAN method based on Inception Score, Fr\'echet Inception Distance, and classification accuracy under the same privacy guarantee.
    Detecting CAN Masquerade Attacks with Signal Clustering Similarity. (arXiv:2201.02665v1 [cs.CR])
    (2 min) Vehicular Controller Area Networks (CANs) are susceptible to cyber attacks of different levels of sophistication. Fabrication attacks are the easiest to administer -- an adversary simply sends (extra) frames on a CAN -- but also the easiest to detect because they disrupt frame frequency. To overcome time-based detection methods, adversaries must administer masquerade attacks by sending frames in lieu of (and therefore at the expected time of) benign frames but with malicious payloads. Research efforts have proven that CAN attacks, and masquerade attacks in particular, can affect vehicle functionality. Examples include causing unintended acceleration, deactivation of vehicle's brakes, as well as steering the vehicle. We hypothesize that masquerade attacks modify the nuanced correlations of CAN signal time series and how they cluster together. Therefore, changes in cluster assignments should indicate anomalous behavior. We confirm this hypothesis by leveraging our previously developed capability for reverse engineering CAN signals (i.e., CAN-D [Controller Area Network Decoder]) and focus on advancing the state of the art for detecting masquerade attacks by analyzing time series extracted from raw CAN frames. Specifically, we demonstrate that masquerade attacks can be detected by computing time series clustering similarity using hierarchical clustering on the vehicle's CAN signals (time series) and comparing the clustering similarity across CAN captures with and without attacks. We test our approach in a previously collected CAN dataset with masquerade attacks (i.e., the ROAD dataset) and develop a forensic tool as a proof of concept to demonstrate the potential of the proposed approach for detecting CAN masquerade attacks.
    Embracing New Techniques in Deep Learning for Estimating Image Memorability. (arXiv:2105.10598v3 [cs.CV] UPDATED)
    (2 min) Various work has suggested that the memorability of an image is consistent across people, and thus can be treated as an intrinsic property of an image. Using computer vision models, we can make specific predictions about what people will remember or forget. While older work has used now-outdated deep learning architectures to predict image memorability, innovations in the field have given us new techniques to apply to this problem. Here, we propose and evaluate five alternative deep learning models which exploit developments in the field from the last five years, largely the introduction of residual neural networks, which are intended to allow the model to use semantic information in the memorability estimation process. These new models were tested against the prior state of the art with a combined dataset built to optimize both within-category and across-category predictions. Our findings suggest that the key prior memorability network had overstated its generalizability and was overfit on its training set. Our new models outperform this prior model, leading us to conclude that Residual Networks outperform simpler convolutional neural networks in memorability regression. We make our new state-of-the-art model readily available to the research community, allowing memory researchers to make predictions about memorability on a wider range of images.
    Robust classification with flexible discriminant analysis in heterogeneous data. (arXiv:2201.02967v1 [stat.ML])
    (2 min) Linear and Quadratic Discriminant Analysis are well-known classical methods but can heavily suffer from non-Gaussian distributions and/or contaminated datasets, mainly because of the underlying Gaussian assumption that is not robust. To fill this gap, this paper presents a new robust discriminant analysis where each data point is drawn by its own arbitrary Elliptically Symmetrical (ES) distribution and its own arbitrary scale parameter. Such a model allows for possibly very heterogeneous, independent but non-identically distributed samples. After deriving a new decision rule, it is shown that maximum-likelihood parameter estimation and classification are very simple, fast and robust compared to state-of-the-art methods.
    Least-Squares ReLU Neural Network (LSNN) Method For Scalar Nonlinear Hyperbolic Conservation Law. (arXiv:2105.11627v3 [math.NA] UPDATED)
    (2 min) We introduced the least-squares ReLU neural network (LSNN) method for solving the linear advection-reaction problem with discontinuous solution and showed that the method outperforms mesh-based numerical methods in terms of the number of degrees of freedom. This paper studies the LSNN method for scalar nonlinear hyperbolic conservation law. The method is a discretization of an equivalent least-squares (LS) formulation in the set of neural network functions with the ReLU activation function. Evaluation of the LS functional is done by using numerical integration and conservative finite volume scheme. Numerical results of some test problems show that the method is capable of approximating the discontinuous interface of the underlying problem automatically through the free breaking lines of the ReLU neural network. Moreover, the method does not exhibit the common Gibbs phenomena along the discontinuous interface.
    Towards the Next 1000 Languages in Multilingual Machine Translation: Exploring the Synergy Between Supervised and Self-Supervised Learning. (arXiv:2201.03110v1 [cs.CL])
    (2 min) Achieving universal translation between all human language pairs is the holy-grail of machine translation (MT) research. While recent progress in massively multilingual MT is one step closer to reaching this goal, it is becoming evident that extending a multilingual MT system simply by training on more parallel data is unscalable, since the availability of labeled data for low-resource and non-English-centric language pairs is forbiddingly limited. To this end, we present a pragmatic approach towards building a multilingual MT model that covers hundreds of languages, using a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs. We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting, even surpassing supervised translation quality for low- and mid-resource languages. We conduct a wide array of experiments to understand the effect of the degree of multilingual supervision, domain mismatches and amounts of parallel and monolingual data on the quality of our self-supervised multilingual models. To demonstrate the scalability of the approach, we train models with over 200 languages and demonstrate high performance on zero-resource translation on several previously under-studied languages. We hope our findings will serve as a stepping stone towards enabling translation for the next thousand languages.
    Medication Error Detection Using Contextual Language Models. (arXiv:2201.03035v1 [cs.CL])
    (2 min) Medication errors most commonly occur at the ordering or prescribing stage, potentially leading to medical complications and poor health outcomes. While it is possible to catch these errors using different techniques; the focus of this work is on textual and contextual analysis of prescription information to detect and prevent potential medication errors. In this paper, we demonstrate how to use BERT-based contextual language models to detect anomalies in written or spoken text based on a data set extracted from real-world medical data of thousands of patient records. The proposed models are able to learn patterns of text dependency and predict erroneous output based on contextual information such as patient data. The experimental results yield accuracy up to 96.63% for text input and up to 79.55% for speech input, which is satisfactory for most real-world applications.
    Learning with less labels in Digital Pathology via Scribble Supervision from natural images. (arXiv:2201.02627v1 [eess.IV])
    (2 min) A critical challenge of training deep learning models in the Digital Pathology (DP) domain is the high annotation cost by medical experts. One way to tackle this issue is via transfer learning from the natural image domain (NI), where the annotation cost is considerably cheaper. Cross-domain transfer learning from NI to DP is shown to be successful via class labels~\cite{teh2020learning}. One potential weakness of relying on class labels is the lack of spatial information, which can be obtained from spatial labels such as full pixel-wise segmentation labels and scribble labels. We demonstrate that scribble labels from NI domain can boost the performance of DP models on two cancer classification datasets (Patch Camelyon Breast Cancer and Colorectal Cancer dataset). Furthermore, we show that models trained with scribble labels yield the same performance boost as full pixel-wise segmentation labels despite being significantly easier and faster to collect.
    Traversing the Local Polytopes of ReLU Neural Networks: A Unified Approach for Network Verification. (arXiv:2111.08922v2 [cs.LG] UPDATED)
    (2 min) Although neural networks (NNs) with ReLU activation functions have found success in a wide range of applications, their adoption in risk-sensitive settings has been limited by the concerns on robustness and interpretability. Previous works to examine robustness and to improve interpretability partially exploited the piecewise linear function form of ReLU NNs. In this paper, we explore the unique topological structure that ReLU NNs create in the input space, identifying the adjacency among the partitioned local polytopes and developing a traversing algorithm based on this adjacency. Our polytope traversing algorithm can be adapted to verify a wide range of network properties related to robustness and interpretability, providing an unified approach to examine the network behavior. As the traversing algorithm explicitly visits all local polytopes, it returns a clear and full picture of the network behavior within the traversed region. The time and space complexity of the traversing algorithm is determined by the number of a ReLU NN's partitioning hyperplanes passing through the traversing region.
    On the Double Descent of Random Features Models Trained with SGD. (arXiv:2110.06910v4 [stat.ML] UPDATED)
    (2 min) This paper studies generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD). In this regime, we derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our analysis shows how to cope with multiple randomness sources of initialization, label noise, and data sampling (as well as stochastic gradients) with no closed-form solution, and also goes beyond the commonly-used Gaussian/spherical data assumption. Our theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias. Besides, we also prove that the constant step-size SGD setting incurs no loss in convergence rate when compared to the exact minimal-norm interpolator, as a theoretical justification of using SGD in practice.
    Rethink Stealthy Backdoor Attacks in Natural Language Processing. (arXiv:2201.02993v1 [cs.CL])
    (2 min) Recently, it has been shown that natural language processing (NLP) models are vulnerable to a kind of security threat called the Backdoor Attack, which utilizes a `backdoor trigger' paradigm to mislead the models. The most threatening backdoor attack is the stealthy backdoor, which defines the triggers as text style or syntactic. Although they have achieved an incredible high attack success rate (ASR), we find that the principal factor contributing to their ASR is not the `backdoor trigger' paradigm. Thus the capacity of these stealthy backdoor attacks is overestimated when categorized as backdoor attacks. Therefore, to evaluate the real attack power of backdoor attacks, we propose a new metric called attack successful rate difference (ASRD), which measures the ASR difference between clean state and poison state models. Besides, since the defenses against stealthy backdoor attacks are absent, we propose Trigger Breaker, consisting of two too simple tricks that can defend against stealthy backdoor attacks effectively. Experiments on text classification tasks show that our method achieves significantly better performance than state-of-the-art defense methods against stealthy backdoor attacks.
    CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. (arXiv:2107.00652v3 [cs.CV] UPDATED)
    (2 min) We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4\% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU. The code and models are available at https://github.com/microsoft/CSWin-Transformer.
    Cluster Regularization via a Hierarchical Feature Regression. (arXiv:2107.04831v2 [stat.ML] UPDATED)
    (2 min) This paper proposes a novel graph-based regularized regression estimator - the hierarchical feature regression (HFR) -, which mobilizes insights from the domains of machine learning and graph theory to estimate robust parameters for a linear regression. The estimator constructs a supervised feature graph that decomposes parameters along its edges, adjusting first for common variation and successively incorporating idiosyncratic patterns into the fitting process. The graph structure has the effect of shrinking parameters towards group targets, where the extent of shrinkage is governed by a hyperparamter, and group compositions as well as shrinkage targets are determined endogenously. The method offers rich resources for the visual exploration of the latent effect structure in the data, and demonstrates good predictive accuracy and versatility when compared to a panel of commonly used regularization techniques across a range of empirical and simulated regression tasks.
    An Adaptive Neuro-Fuzzy System with Integrated Feature Selection and Rule Extraction for High-Dimensional Classification Problems. (arXiv:2201.03187v1 [cs.LG])
    (2 min) A major limitation of fuzzy or neuro-fuzzy systems is their failure to deal with high-dimensional datasets. This happens primarily due to the use of T-norm, particularly, product or minimum (or a softer version of it). Thus, there are hardly any work dealing with datasets with dimensions more than hundred or so. Here, we propose a neuro-fuzzy framework that can handle datasets with dimensions even more than 7000! In this context, we propose an adaptive softmin (Ada-softmin) which effectively overcomes the drawbacks of ``numeric underflow" and ``fake minimum" that arise for existing fuzzy systems while dealing with high-dimensional problems. We call it an Adaptive Takagi-Sugeno-Kang (AdaTSK) fuzzy system. We then equip the AdaTSK system to perform feature selection and rule extraction in an integrated manner. In this context, a novel gate function is introduced and embedded only in the consequent parts, which can determine the useful features and rules, in two successive phases of learning. Unlike conventional fuzzy rule bases, we design an enhanced fuzzy rule base (En-FRB), which maintains adequate rules but does not grow the number of rules exponentially with dimension that typically happens for fuzzy neural networks. The integrated Feature Selection and Rule Extraction AdaTSK (FSRE-AdaTSK) system consists of three sequential phases: (i) feature selection, (ii) rule extraction, and (iii) fine tuning. The effectiveness of the FSRE-AdaTSK is demonstrated on 19 datasets of which five are in more than 2000 dimension including two with dimension greater than 7000. This may be the first time fuzzy systems are realized for classification involving more than 7000 input features.
    Conditional Approximate Normalizing Flows for Joint Multi-Step Probabilistic Electricity Demand Forecasting. (arXiv:2201.02753v1 [cs.LG])
    (2 min) Some real-world decision-making problems require making probabilistic forecasts over multiple steps at once. However, methods for probabilistic forecasting may fail to capture correlations in the underlying time-series that exist over long time horizons as errors accumulate. One such application is with resource scheduling under uncertainty in a grid environment, which requires forecasting electricity demand that is inherently noisy, but often cyclic. In this paper, we introduce the conditional approximate normalizing flow (CANF) to make probabilistic multi-step time-series forecasts when correlations are present over long time horizons. We first demonstrate our method's efficacy on estimating the density of a toy distribution, finding that CANF improves the KL divergence by one-third compared to that of a Gaussian mixture model while still being amenable to explicit conditioning. We then use a publicly available household electricity consumption dataset to showcase the effectiveness of CANF on joint probabilistic multi-step forecasting. Empirical results show that conditional approximate normalizing flows outperform other methods in terms of multi-step forecast accuracy and lead to up to 10x better scheduling decisions. Our implementation is available at https://github.com/sisl/JointDemandForecasting.
    FedDTG:Federated Data-Free Knowledge Distillation via Three-Player Generative Adversarial Networks. (arXiv:2201.03169v1 [cs.LG])
    (2 min) Applying knowledge distillation to personalized cross-silo federated learning can well alleviate the problem of user heterogeneity. This approach, however, requires a proxy dataset, which is difficult to obtain in the real world. Moreover, the global model based on parameter averaging will lead to the leakage of user privacy. We introduce a distributed three-player GAN to implement datafree co-distillation between clients. This technique mitigates the user heterogeneity problem and better protects user privacy. We confirmed that thefake samples generated by GAN can make federated distillation more efficient and robust, and the co-distillation can achieve good performance for individual clients on the basis of obtaining global knowledge. Our extensive experiments on benchmark datasets demonstrate the superior generalization performance of the proposed methods, compared with the state-of-the-art.
    RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues. (arXiv:2201.02798v1 [cs.CV])
    (2 min) The gap between simulation and the real-world restrains many machine learning breakthroughs in computer vision and reinforcement learning from being applicable in the real world. In this work, we tackle this gap for the specific case of camera-based navigation, formulating it as following a visual cue in the foreground with arbitrary backgrounds. The visual cue in the foreground can often be simulated realistically, such as a line, gate or cone. The challenge then lies in coping with the unknown backgrounds and integrating both. As such, the goal is to train a visual agent on data captured in an empty simulated environment except for this foreground cue and test this model directly in a visually diverse real world. In order to bridge this big gap, we show it's crucial to combine following techniques namely: Randomized augmentation of the fore- and background, regularization with both deep supervision and triplet loss and finally abstraction of the dynamics by using waypoints rather than direct velocity commands. The various techniques are ablated in our experimental results both qualitatively and quantitatively finally demonstrating a successful transfer from simulation to the real world.
    Specializing Versatile Skill Libraries using Local Mixture of Experts. (arXiv:2112.04216v2 [cs.LG] UPDATED)
    (2 min) A long-cherished vision in robotics is to equip robots with skills that match the versatility and precision of humans. For example, when playing table tennis, a robot should be capable of returning the ball in various ways while precisely placing it at the desired location. A common approach to model such versatile behavior is to use a Mixture of Experts (MoE) model, where each expert is a contextual motion primitive. However, learning such MoEs is challenging as most objectives force the model to cover the entire context space, which prevents specialization of the primitives resulting in rather low-quality components. Starting from maximum entropy reinforcement learning (RL), we decompose the objective into optimizing an individual lower bound per mixture component. Further, we introduce a curriculum by allowing the components to focus on a local context region, enabling the model to learn highly accurate skill representations. To this end, we use local context distributions that are adapted jointly with the expert primitives. Our lower bound advocates an iterative addition of new components, where new components will concentrate on local context regions not covered by the current MoE. This local and incremental learning results in a modular MoE model of high accuracy and versatility, where both properties can be scaled by adding more components on the fly. We demonstrate this by an extensive ablation and on two challenging simulated robot skill learning tasks. We compare our achieved performance to LaDiPS and HiREPS, a known hierarchical policy search method for learning diverse skills.
    Lifted Model Checking for Relational MDPs. (arXiv:2106.11735v2 [cs.LG] UPDATED)
    (0 min) Probabilistic model checking has been developed for verifying systems that have stochastic and nondeterministic behavior. Given a probabilistic system, a probabilistic model checker takes a property and checks whether or not the property holds in that system. For this reason, probabilistic model checking provide rigorous guarantees. So far, however, probabilistic model checking has focused on propositional models where a state is represented by a symbol. On the other hand, it is commonly required to make relational abstractions in planning and reinforcement learning. Various frameworks handle relational domains, for instance, STRIPS planning and relational Markov Decision Processes. Using propositional model checking in relational settings requires one to ground the model, which leads to the well known state explosion problem and intractability. We present pCTL-REBEL, a lifted model checking approach for verifying pCTL properties of relational MDPs. It extends REBEL, a relational model-based reinforcement learning technique, toward relational pCTL model checking. PCTL-REBEL is lifted, which means that rather than grounding, the model exploits symmetries to reason about a group of objects as a whole at the relational level. Theoretically, we show that pCTL model checking is decidable for relational MDPs that have a possibly infinite domain, provided that the states have a bounded size. Practically, we contribute algorithms and an implementation of lifted relational model checking, and we show that the lifted approach improves the scalability of the model checking approach.
    GECKO: Reconciling Privacy, Accuracy and Efficiency in Embedded Deep Learning. (arXiv:2010.00912v3 [cs.CR] UPDATED)
    (0 min) Embedded systems demand on-device processing of data using Neural Networks (NNs) while conforming to the memory, power and computation constraints, leading to an efficiency and accuracy tradeoff. To bring NNs to edge devices, several optimizations such as model compression through pruning, quantization, and off-the-shelf architectures with efficient design have been extensively adopted. These algorithms when deployed to real world sensitive applications, requires to resist inference attacks to protect privacy of users training data. However, resistance against inference attacks is not accounted for designing NN models for IoT. In this work, we analyse the three-dimensional privacy-accuracy-efficiency tradeoff in NNs for IoT devices and propose Gecko training methodology where we explicitly add resistance to private inferences as a design objective. We optimize the inference-time memory, computation, and power constraints of embedded devices as a criterion for designing NN architecture while also preserving privacy. We choose quantization as design choice for highly efficient and private models. This choice is driven by the observation that compressed models leak more information compared to baseline models while off-the-shelf efficient architectures indicate poor efficiency and privacy tradeoff. We show that models trained using Gecko methodology are comparable to prior defences against black-box membership attacks in terms of accuracy and privacy while providing efficiency.
    Introduction to Multi-Armed Bandits. (arXiv:1904.07272v7 [cs.LG] UPDATED)
    (2 min) Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a brief review of the further developments; many of the chapters conclude with exercises. The book is structured as follows. The first four chapters are on IID rewards, from the basic model to impossibility results to Bayesian priors to Lipschitz rewards. The next three chapters cover adversarial rewards, from the full-feedback version to adversarial bandits to extensions with linear rewards and combinatorially structured actions. Chapter 8 is on contextual bandits, a middle ground between IID and adversarial bandits in which the change in reward distributions is completely explained by observable contexts. The last three chapters cover connections to economics, from learning in repeated games to bandits with supply/budget constraints to exploration in the presence of incentives. The appendix provides sufficient background on concentration and KL-divergence. The chapters on "bandits with similarity information", "bandits with knapsacks" and "bandits and agents" can also be consumed as standalone surveys on the respective topics.
    Glance and Focus Networks for Dynamic Visual Recognition. (arXiv:2201.03014v1 [cs.CV])
    (2 min) Spatial redundancy widely exists in visual recognition tasks, i.e., discriminative features in an image or video frame usually correspond to only a subset of pixels, while the remaining regions are irrelevant to the task at hand. Therefore, static models which process all the pixels with an equal amount of computation result in considerable redundancy in terms of time and space consumption. In this paper, we formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system. Specifically, the proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features. The sequential process naturally facilitates adaptive inference at test time, as it can be terminated once the model is sufficiently confident about its prediction, avoiding further redundant computation. It is worth noting that the problem of locating discriminant regions in our model is formulated as a reinforcement learning task, thus requiring no additional manual annotations other than classification labels. GFNet is general and flexible as it is compatible with any off-the-shelf backbone models (such as MobileNets, EfficientNets and TSM), which can be conveniently deployed as the feature extractor. Extensive experiments on a variety of image classification and video recognition tasks and with various backbone models demonstrate the remarkable efficiency of our method. For example, it reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 1.3x without sacrificing accuracy. Code and pre-trained models are available at https://github.com/blackfeather-wang/GFNet-Pytorch.
    AGMI: Attention-Guided Multi-omics Integration for Drug Response Prediction with Graph Neural Networks. (arXiv:2112.08366v2 [q-bio.GN] UPDATED)
    (2 min) Accurate drug response prediction (DRP) is a crucial yet challenging task in precision medicine. This paper presents a novel Attention-Guided Multi-omics Integration (AGMI) approach for DRP, which first constructs a Multi-edge Graph (MeG) for each cell line, and then aggregates multi-omics features to predict drug response using a novel structure, called Graph edge-aware Network (GeNet). For the first time, our AGMI approach explores gene constraint based multi-omics integration for DRP with the whole-genome using GNNs. Empirical experiments on the CCLE and GDSC datasets show that our AGMI largely outperforms state-of-the-art DRP methods by 8.3%--34.2% on four metrics. Our data and code are available at https://github.com/yivan-WYYGDSG/AGMI.
    Fair and efficient contribution valuation for vertical federated learning. (arXiv:2201.02658v1 [cs.LG])
    (2 min) Federated learning is a popular technology for training machine learning models on distributed data sources without sharing data. Vertical federated learning or feature-based federated learning applies to the cases that different data sources share the same sample ID space but differ in feature space. To ensure the data owners' long-term engagement, it is critical to objectively assess the contribution from each data source and recompense them accordingly. The Shapley value (SV) is a provably fair contribution valuation metric originated from cooperative game theory. However, computing the SV requires extensively retraining the model on each subset of data sources, which causes prohibitively high communication costs in federated learning. We propose a contribution valuation metric called vertical federated Shapley value (VerFedSV) based on SV. We show that VerFedSV not only satisfies many desirable properties for fairness but is also efficient to compute, and can be adapted to both synchronous and asynchronous vertical federated learning algorithms. Both theoretical analysis and extensive experimental results verify the fairness, efficiency, and adaptability of VerFedSV.
    Data-Efficient Information Extraction from Form-Like Documents. (arXiv:2201.02647v1 [cs.LG])
    (2 min) Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.
    Hyperspectral Image Denoising Using Non-convex Local Low-rank and Sparse Separation with Spatial-Spectral Total Variation Regularization. (arXiv:2201.02812v1 [eess.IV])
    (2 min) In this paper, we propose a novel nonconvex approach to robust principal component analysis for HSI denoising, which focuses on simultaneously developing more accurate approximations to both rank and column-wise sparsity for the low-rank and sparse components, respectively. In particular, the new method adopts the log-determinant rank approximation and a novel $\ell_{2,\log}$ norm, to restrict the local low-rank or column-wisely sparse properties for the component matrices, respectively. For the $\ell_{2,\log}$-regularized shrinkage problem, we develop an efficient, closed-form solution, which is named $\ell_{2,\log}$-shrinkage operator. The new regularization and the corresponding operator can be generally used in other problems that require column-wise sparsity. Moreover, we impose the spatial-spectral total variation regularization in the log-based nonconvex RPCA model, which enhances the global piece-wise smoothness and spectral consistency from the spatial and spectral views in the recovered HSI. Extensive experiments on both simulated and real HSIs demonstrate the effectiveness of the proposed method in denoising HSIs.
    Weak Supervision for Affordable Modeling of Electrocardiogram Data. (arXiv:2201.02936v1 [eess.SP])
    (2 min) Analysing electrocardiograms (ECGs) is an inexpensive and non-invasive, yet powerful way to diagnose heart disease. ECG studies using Machine Learning to automatically detect abnormal heartbeats so far depend on large, manually annotated datasets. While collecting vast amounts of unlabeled data can be straightforward, the point-by-point annotation of abnormal heartbeats is tedious and expensive. We explore the use of multiple weak supervision sources to learn diagnostic models of abnormal heartbeats via human designed heuristics, without using ground truth labels on individual data points. Our work is among the first to define weak supervision sources directly on time series data. Results show that with as few as six intuitive time series heuristics, we are able to infer high quality probabilistic label estimates for over 100,000 heartbeats with little human effort, and use the estimated labels to train competitive classifiers evaluated on held out test data.
    Attention-based Random Forest and Contamination Model. (arXiv:2201.02880v1 [cs.LG])
    (2 min) A new approach called ABRF (the attention-based random forest) and its modifications for applying the attention mechanism to the random forest (RF) for regression and classification are proposed. The main idea behind the proposed ABRF models is to assign attention weights with trainable parameters to decision trees in a specific way. The weights depend on the distance between an instance, which falls into a corresponding leaf of a tree, and instances, which fall in the same leaf. This idea stems from representation of the Nadaraya-Watson kernel regression in the form of a RF. Three modifications of the general approach are proposed. The first one is based on applying the Huber's contamination model and on computing the attention weights by solving quadratic or linear optimization problems. The second and the third modifications use the gradient-based algorithms for computing trainable parameters. Numerical experiments with various regression and classification datasets illustrate the proposed method.
    Posture Prediction for Healthy Sitting using a Smart Chair. (arXiv:2201.02615v1 [cs.LG])
    (2 min) Poor sitting habits have been identified as a risk factor to musculoskeletal disorders and lower back pain especially on the elderly, disabled people, and office workers. In the current computerized world, even while involved in leisure or work activity, people tend to spend most of their days sitting at computer desks. This can result in spinal pain and related problems. Therefore, a means to remind people about their sitting habits and provide recommendations to counterbalance, such as physical exercise, is important. Posture recognition for seated postures have not received enough attention as most works focus on standing postures. Wearable sensors, pressure or force sensors, videos and images were used for posture recognition in the literature. The aim of this study is to build Machine Learning models for classifying sitting posture of a person by analyzing data collected from a chair platted with two 32 by 32 pressure sensors at its seat and backrest. Models were built using five algorithms: Random Forest (RF), Gaussian Na\"ive Bayes, Logistic Regression, Support Vector Machine and Deep Neural Network (DNN). All the models are evaluated using KFold cross-validation technique. This paper presents experiments conducted using the two separate datasets, controlled and realistic, and discusses results achieved at classifying six sitting postures. Average classification accuracies of 98% and 97% were achieved on the controlled and realistic datasets, respectively.
    Clustering Text Using Attention. (arXiv:2201.02816v1 [cs.CL])
    (2 min) Clustering Text has been an important problem in the domain of Natural Language Processing. While there are techniques to cluster text based on using conventional clustering techniques on top of contextual or non-contextual vector space representations, it still remains a prevalent area of research possible to various improvements in performance and implementation of these techniques. This paper discusses a novel technique to cluster text using attention mechanisms. Attention Mechanisms have proven to be highly effective in various NLP tasks in recent times. This paper extends the idea of attention mechanism in clustering space and sheds some light on a whole new area of research
    Causal Discovery from Sparse Time-Series Data Using Echo State Network. (arXiv:2201.02933v1 [cs.LG])
    (2 min) Causal discovery between collections of time-series data can help diagnose causes of symptoms and hopefully prevent faults before they occur. However, reliable causal discovery can be very challenging, especially when the data acquisition rate varies (i.e., non-uniform data sampling), or in the presence of missing data points (e.g., sparse data sampling). To address these issues, we proposed a new system comprised of two parts, the first part fills missing data with a Gaussian Process Regression, and the second part leverages an Echo State Network, which is a type of reservoir computer (i.e., used for chaotic system modeling) for Causal discovery. We evaluate the performance of our proposed system against three other off-the-shelf causal discovery algorithms, namely, structural expectation-maximization, sub-sampled linear auto-regression absolute coefficients, and multivariate Granger Causality with vector auto-regressive using the Tennessee Eastman chemical dataset; we report on their corresponding Matthews Correlation Coefficient(MCC) and Receiver Operating Characteristic curves (ROC) and show that the proposed system outperforms existing algorithms, demonstrating the viability of our approach to discover causal relationships in a complex system with missing entries.
    Reconfigurable Intelligent Surface Enabled Spatial Multiplexing with Fully Convolutional Network. (arXiv:2201.02834v1 [eess.SP])
    (2 min) Reconfigurable intelligent surface (RIS) is an emerging technology for future wireless communication systems. In this work, we consider downlink spatial multiplexing enabled by the RIS for weighted sum-rate (WSR) maximization. In the literature, most solutions use alternating gradient-based optimization, which has moderate performance, high complexity, and limited scalability. We propose to apply a fully convolutional network (FCN) to solve this problem, which was originally designed for semantic segmentation of images. The rectangular shape of the RIS and the spatial correlation of channels with adjacent RIS antennas due to the short distance between them encourage us to apply it for the RIS configuration. We design a set of channel features that includes both cascaded channels via the RIS and the direct channel. In the base station (BS), the differentiable minimum mean squared error (MMSE) precoder is used for pretraining and the weighted minimum mean squared error (WMMSE) precoder is then applied for fine-tuning, which is nondifferentiable, more complex, but achieves a better performance. Evaluation results show that the proposed solution has higher performance and allows for a faster evaluation than the baselines. Hence it scales better to a large number of antennas, advancing the RIS one step closer to practical deployment.
    Cognitive Computing to Optimize IT Services. (arXiv:2201.02737v1 [cs.CL])
    (2 min) In this paper, the challenges of maintaining a healthy IT operational environment have been addressed by proactively analyzing IT Service Desk tickets, customer satisfaction surveys, and social media data. A Cognitive solution goes beyond the traditional structured data analysis by deep analyses of both structured and unstructured text. The salient features of the proposed platform include language identification, translation, hierarchical extraction of the most frequently occurring topics, entities and their relationships, text summarization, sentiments, and knowledge extraction from the unstructured text using Natural Language Processing techniques. Moreover, the insights from unstructured text combined with structured data allow the development of various classification, segmentation, and time-series forecasting use-cases on the incident, problem, and change datasets. Further, the text and predictive insights together with raw data are used for visualization and exploration of actionable insights on a rich and interactive dashboard. However, it is hard not only to find these insights using traditional structured data analysis but it might also take a very long time to discover them, especially while dealing with a massive amount of unstructured data. By taking action on these insights, organizations can benefit from a significant reduction of ticket volume, reduced operational costs, and increased customer satisfaction. In various experiments, on average, upto 18-25% of yearly ticket volume has been reduced using the proposed approach.
    A Deep Learning Approach to Integrate Human-Level Understanding in a Chatbot. (arXiv:2201.02735v1 [cs.CL])
    (2 min) In recent times, a large number of people have been involved in establishing their own businesses. Unlike humans, chatbots can serve multiple customers at a time, are available 24/7 and reply in less than a fraction of a second. Though chatbots perform well in task-oriented activities, in most cases they fail to understand personalized opinions, statements or even queries which later impact the organization for poor service management. Lack of understanding capabilities in bots disinterest humans to continue conversations with them. Usually, chatbots give absurd responses when they are unable to interpret a user's text accurately. Extracting the client reviews from conversations by using chatbots, organizations can reduce the major gap of understanding between the users and the chatbot and improve their quality of products and services.Thus, in our research we incorporated all the key elements that are necessary for a chatbot to analyse and understand an input text precisely and accurately. We performed sentiment analysis, emotion detection, intent classification and named-entity recognition using deep learning to develop chatbots with humanistic understanding and intelligence. The efficiency of our approach can be demonstrated accordingly by the detailed analysis.
    Improved Input Reprogramming for GAN Conditioning. (arXiv:2201.02692v1 [cs.LG])
    (2 min) We study the GAN conditioning problem, whose goal is to convert a pretrained unconditional GAN into a conditional GAN using labeled data. We first identify and analyze three approaches to this problem -- conditional GAN training from scratch, fine-tuning, and input reprogramming. Our analysis reveals that when the amount of labeled data is small, input reprogramming performs the best. Motivated by real-world scenarios with scarce labeled data, we focus on the input reprogramming approach and carefully analyze the existing algorithm. After identifying a few critical issues of the previous input reprogramming approach, we propose a new algorithm called InRep+. Our algorithm InRep+ addresses the existing issues with the novel uses of invertible neural networks and Positive-Unlabeled (PU) learning. Via extensive experiments, we show that InRep+ outperforms all existing methods, particularly when label information is scarce, noisy, and/or imbalanced. For instance, for the task of conditioning a CIFAR10 GAN with 1% labeled data, InRep+ achieves an average Intra-FID of 82.13, whereas the second-best method achieves 114.51.
    Block Walsh-Hadamard Transform Based Binary Layers in Deep Neural Networks. (arXiv:2201.02711v1 [cs.LG])
    (2 min) Convolution has been the core operation of modern deep neural networks. It is well-known that convolutions can be implemented in the Fourier Transform domain. In this paper, we propose to use binary block Walsh-Hadamard transform (WHT) instead of the Fourier transform. We use WHT-based binary layers to replace some of the regular convolution layers in deep neural networks. We utilize both one-dimensional (1-D) and two-dimensional (2-D) binary WHTs in this paper. In both 1-D and 2-D layers, we compute the binary WHT of the input feature map and denoise the WHT domain coefficients using a nonlinearity which is obtained by combining soft-thresholding with the tanh function. After denoising, we compute the inverse WHT. We use 1D-WHT to replace the $1\times 1$ convolutional layers, and 2D-WHT layers can replace the 3$\times$3 convolution layers and Squeeze-and-Excite layers. 2D-WHT layers with trainable weights can be also inserted before the Global Average Pooling (GAP) layers to assist the dense layers. In this way, we can reduce the number of trainable parameters significantly with a slight decrease in trainable parameters. In this paper, we implement the WHT layers into MobileNet-V2, MobileNet-V3-Large, and ResNet to reduce the number of parameters significantly with negligible accuracy loss. Moreover, according to our speed test, the 2D-FWHT layer runs about 24 times as fast as the regular $3\times 3$ convolution with 19.51\% less RAM usage in an NVIDIA Jetson Nano experiment.
    Assessing Policy, Loss and Planning Combinations in Reinforcement Learning using a New Modular Architecture. (arXiv:2201.02874v1 [cs.LG])
    (2 min) The model-based reinforcement learning paradigm, which uses planning algorithms and neural network models, has recently achieved unprecedented results in diverse applications, leading to what is now known as deep reinforcement learning. These agents are quite complex and involve multiple components, factors that can create challenges for research. In this work, we propose a new modular software architecture suited for these types of agents, and a set of building blocks that can be easily reused and assembled to construct new model-based reinforcement learning agents. These building blocks include planning algorithms, policies, and loss functions. We illustrate the use of this architecture by combining several of these building blocks to implement and test agents that are optimized to three different test environments: Cartpole, Minigrid, and Tictactoe. One particular planning algorithm, made available in our implementation and not previously used in reinforcement learning, which we called averaged minimax, achieved good results in the three tested environments. Experiments performed with this architecture have shown that the best combination of planning algorithm, policy, and loss function is heavily problem dependent. This result provides evidence that the proposed architecture, which is modular and reusable, is useful for reinforcement learning researchers who want to study new environments and techniques.
    Bitcoin Price Predictive Modeling Using Expert Correction. (arXiv:2201.02729v1 [q-fin.ST])
    (2 min) The paper studies the linear model for Bitcoin price which includes regression features based on Bitcoin currency statistics, mining processes, Google search trends, Wikipedia pages visits. The pattern of deviation of regression model prediction from real prices is simpler comparing to price time series. It is assumed that this pattern can be predicted by an experienced expert. In such a way, using the combination of the regression model and expert correction, one can receive better results than with either regression model or expert opinion only. It is shown that Bayesian approach makes it possible to utilize the probabilistic approach using distributions with fat tails and take into account the outliers in Bitcoin price time series.
    Open-Set Recognition of Breast Cancer Treatments. (arXiv:2201.02923v1 [cs.LG])
    (2 min) Open-set recognition generalizes a classification task by classifying test samples as one of the known classes from training or "unknown." As novel cancer drug cocktails with improved treatment are continually discovered, predicting cancer treatments can naturally be formulated in terms of an open-set recognition problem. Drawbacks, due to modeling unknown samples during training, arise from straightforward implementations of prior work in healthcare open-set learning. Accordingly, we reframe the problem methodology and apply a recent existing Gaussian mixture variational autoencoder model, which achieves state-of-the-art results for image datasets, to breast cancer patient data. Not only do we obtain more accurate and robust classification results, with a 24.5% average F1 increase compared to a recent method, but we also reexamine open-set recognition in terms of deployability to a clinical setting.
    Spatio-Temporal Graph Representation Learning for Fraudster Group Detection. (arXiv:2201.02621v1 [cs.LG])
    (2 min) Motivated by potential financial gain, companies may hire fraudster groups to write fake reviews to either demote competitors or promote their own businesses. Such groups are considerably more successful in misleading customers, as people are more likely to be influenced by the opinion of a large group. To detect such groups, a common model is to represent fraudster groups' static networks, consequently overlooking the longitudinal behavior of a reviewer thus the dynamics of co-review relations among reviewers in a group. Hence, these approaches are incapable of excluding outlier reviewers, which are fraudsters intentionally camouflaging themselves in a group and genuine reviewers happen to co-review in fraudster groups. To address this issue, in this work, we propose to first capitalize on the effectiveness of the HIN-RNN in both reviewers' representation learning while capturing the collaboration between reviewers, we first utilize the HIN-RNN to model the co-review relations of reviewers in a group in a fixed time window of 28 days. We refer to this as spatial relation learning representation to signify the generalisability of this work to other networked scenarios. Then we use an RNN on the spatial relations to predict the spatio-temporal relations of reviewers in the group. In the third step, a Graph Convolution Network (GCN) refines the reviewers' vector representations using these predicted relations. These refined representations are then used to remove outlier reviewers. The average of the remaining reviewers' representation is then fed to a simple fully connected layer to predict if the group is a fraudster group or not. Exhaustive experiments of the proposed approach showed a 5% (4%), 12% (5%), 12% (5%) improvement over three of the most recent approaches on precision, recall, and F1-value over the Yelp (Amazon) dataset, respectively.
    Optimal 1-Wasserstein Distance for WGANs. (arXiv:2201.02824v1 [stat.ML])
    (2 min) The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes. We study the specific case where the latent space is univariate and derive results valid regardless of the dimension of the output space. We show in particular that for a fixed sample size, the optimal WGANs are closely linked with connected paths minimizing the sum of the squared Euclidean distances between the sample points. We also highlight the fact that WGANs are able to approach (for the 1-Wasserstein distance) the target distribution as the sample size tends to infinity, at a given convergence rate and provided the family of generative Lipschitz functions grows appropriately. We derive in passing new results on optimal transport theory in the semi-discrete setting.
    Attacking Vertical Collaborative Learning System Using Adversarial Dominating Inputs. (arXiv:2201.02775v1 [cs.CR])
    (2 min) Vertical collaborative learning system also known as vertical federated learning (VFL) system has recently become prominent as a concept to process data distributed across many individual sources without the need to centralize it. Multiple participants collaboratively train models based on their local data in a privacy-preserving manner. To date, VFL has become a de facto solution to securely learn a model among organizations, allowing knowledge to be shared without compromising privacy of any individual organizations. Despite the prosperous development of VFL systems, we find that certain inputs of a participant, named adversarial dominating inputs (ADIs), can dominate the joint inference towards the direction of the adversary's will and force other (victim) participants to make negligible contributions, losing rewards that are usually offered regarding the importance of their contributions in collaborative learning scenarios. We conduct a systematic study on ADIs by first proving their existence in typical VFL systems. We then propose gradient-based methods to synthesize ADIs of various formats and exploit common VFL systems. We further launch greybox fuzz testing, guided by the resiliency score of "victim" participants, to perturb adversary-controlled inputs and systematically explore the VFL attack surface in a privacy-preserving manner. We conduct an in-depth study on the influence of critical parameters and settings in synthesizing ADIs. Our study reveals new VFL attack opportunities, promoting the identification of unknown threats before breaches and building more secure VFL systems.
    PocketNN: Integer-only Training and Inference of Neural Networks via Direct Feedback Alignment and Pocket Activations in Pure C++. (arXiv:2201.02863v1 [cs.LG])
    (2 min) Standard deep learning algorithms are implemented using floating-point real numbers. This presents an obstacle for implementing them on low-end devices which may not have dedicated floating-point units (FPUs). As a result, researchers in TinyML have considered machine learning algorithms that can train and run a deep neural network (DNN) on a low-end device using integer operations only. In this paper we propose PocketNN, a light and self-contained proof-of-concept framework in pure C++ for the training and inference of DNNs using only integers. Unlike other approaches, PocketNN directly operates on integers without requiring any explicit quantization algorithms or customized fixed-point formats. This was made possible by pocket activations, which are a family of activation functions devised for integer-only DNNs, and an emerging DNN training algorithm called direct feedback alignment (DFA). Unlike the standard backpropagation (BP), DFA trains each layer independently, thus avoiding integer overflow which is a key problem when using BP with integer-only operations. We used PocketNN to train some DNNs on two well-known datasets, MNIST and Fashion-MNIST. Our experiments show that the DNNs trained with our PocketNN achieved 96.98% and 87.7% accuracies on MNIST and Fashion-MNIST datasets, respectively. The accuracies are very close to the equivalent DNNs trained using BP with floating-point real number operations, such that accuracy degradations were just 1.02%p and 2.09%p, respectively. Finally, our PocketNN has high compatibility and portability for low-end devices as it is open source and implemented in pure C++ without any dependencies.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v1 [cs.LG])
    (2 min) A significant bottleneck in federated learning is the network communication cost of sending model updates from client devices to the central server. We propose a method to reduce this cost. Our method encodes quantized updates with an appropriate universal code, taking into account their empirical distribution. Because quantization introduces error, we select quantization levels by optimizing for the desired trade-off in average total bitrate and gradient distortion. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds, and within each setting, distortion reliably predicts model performance. This allows for a remarkably simple compression scheme that is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on the Stack Overflow next-word prediction benchmark.
    Deep Generative Modeling for Volume Reconstruction in Cryo-Electron Microscop. (arXiv:2201.02867v1 [eess.IV])
    (2 min) Recent breakthroughs in high resolution imaging of biomolecules in solution with cryo-electron microscopy (cryo-EM) have unlocked new doors for the reconstruction of molecular volumes, thereby promising further advances in biology, chemistry, and pharmacological research amongst others. Despite significant headway, the immense challenges in cryo-EM data analysis remain legion and intricately inter-disciplinary in nature, requiring insights from physicists, structural biologists, computer scientists, statisticians, and applied mathematicians. Meanwhile, recent next-generation volume reconstruction algorithms that combine generative modeling with end-to-end unsupervised deep learning techniques have shown promising results on simulated data, but still face considerable hurdles when applied to experimental cryo-EM images. In light of the proliferation of such methods and given the interdisciplinary nature of the task, we propose here a critical review of recent advances in the field of deep generative modeling for high resolution cryo-EM volume reconstruction. The present review aims to (i) compare and contrast these new methods, while (ii) presenting them from a perspective and using terminology familiar to scientists in each of the five aforementioned fields with no specific background in cryo-EM. The review begins with an introduction to the mathematical and computational challenges of deep generative models for cryo-EM volume reconstruction, along with an overview of the baseline methodology shared across this class of algorithms. Having established the common thread weaving through these different models, we provide a practical comparison of these state-of-the-art algorithms, highlighting their relative strengths and weaknesses, along with the assumptions that they rely on. This allows us to identify bottlenecks in current methods and avenues for future research.
    Stay Positive: Knowledge Graph Embedding Without Negative Sampling. (arXiv:2201.02661v1 [cs.LG])
    (2 min) Knowledge graphs (KGs) are typically incomplete and we often wish to infer new facts given the existing ones. This can be thought of as a binary classification problem; we aim to predict if new facts are true or false. Unfortunately, we generally only have positive examples (the known facts) but we also need negative ones to train a classifier. To resolve this, it is usual to generate negative examples using a negative sampling strategy. However, this can produce false negatives which may reduce performance, is computationally expensive, and does not produce calibrated classification probabilities. In this paper, we propose a training procedure that obviates the need for negative sampling by adding a novel regularization term to the loss function. Our results for two relational embedding models (DistMult and SimplE) show the merit of our proposal both in terms of performance and speed.
    Testing the Robustness of a BiLSTM-based Structural Story Classifier. (arXiv:2201.02733v1 [cs.CL])
    (2 min) The growing prevalence of counterfeit stories on the internet has fostered significant interest towards fast and scalable detection of fake news in the machine learning community. While several machine learning techniques for this purpose have emerged, we observe that there is a need to evaluate the impact of noise on these techniques' performance, where noise constitutes news articles being mistakenly labeled as fake (or real). This work takes a step in that direction, where we examine the impact of noise on a state-of-the-art, structural model based on BiLSTM (Bidirectional Long-Short Term Model) for fake news detection, Hierarchical Discourse-level Structure for Fake News Detection by Karimi and Tang (Reference no. 9).
    Extraction of Product Specifications from the Web -- Going Beyond Tables and Lists. (arXiv:2201.02896v1 [cs.IR])
    (2 min) E-commerce product pages on the web often present product specification data in structured tabular blocks. Extraction of these product attribute-value specifications has benefited applications like product catalogue curation, search, question answering, and others. However, across different Websites, there is a wide variety of HTML elements (like , , , , etc.) typically used to render these blocks that makes their automatic extraction a challenge. Most of the current research has focused on extracting product specifications from tables and lists and, therefore, suffers from recall when applied to a large-scale extraction setting. In this paper, we present a product specification extraction approach that goes beyond tables or lists and generalizes across the diverse HTML elements used for rendering specification blocks. Using a combination of hand-coded features and deep learned spatial and token features, we first identify the specification blocks on a product page. We then extract the product attribute-value pairs from these blocks following an approach inspired by wrapper induction. We created a labeled dataset of product specifications extracted from 14,111 diverse specification blocks taken from a range of different product websites. Our experiments show the efficacy of our approach compared to the current specification extraction models and support our claim about its application to large-scale product specification extraction.
    An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic. (arXiv:2201.02968v1 [cs.LG])
    (2 min) Recently, the applications of deep neural network (DNN) have been very prominent in many fields such as computer vision (CV) and natural language processing (NLP) due to its superior feature extraction performance. However, the high-dimension parameter model and large-scale mathematical calculation restrict the execution efficiency, especially for Internet of Things (IoT) devices. Different from the previous cloud/edge-only pattern that brings huge pressure for uplink communication and device-only fashion that undertakes unaffordable calculation strength, we highlight the collaborative computation between the device and edge for DNN models, which can achieve a good balance between the communication load and execution accuracy. Specifically, a systematic on-demand co-inference framework is proposed to exploit the multi-branch structure, in which the pre-trained Alexnet is right-sized through \emph{early-exit} and partitioned at an intermediate DNN layer. The integer quantization is enforced to further compress transmission bits. As a result, we establish a new Deep Reinforcement Learning (DRL) optimizer-Soft Actor Critic for discrete (SAC-d), which generates the \emph{exit point}, \emph{partition point}, and \emph{compressing bits} by soft policy iterations. Based on the latency and accuracy aware reward design, such an optimizer can well adapt to the complex environment like dynamic wireless channel and arbitrary CPU processing, and is capable of supporting the 5G URLLC. Real-world experiment on Raspberry Pi 4 and PC shows the outperformance of the proposed solution.
    LoMar: A Local Defense Against Poisoning Attack on Federated Learning. (arXiv:2201.02873v1 [cs.LG])
    (2 min) Federated learning (FL) provides a high efficient decentralized machine learning framework, where the training data remains distributed at remote clients in a network. Though FL enables a privacy-preserving mobile edge computing framework using IoT devices, recent studies have shown that this approach is susceptible to poisoning attacks from the side of remote clients. To address the poisoning attacks on FL, we provide a \textit{two-phase} defense algorithm called {Lo}cal {Ma}licious Facto{r} (LoMar). In phase I, LoMar scores model updates from each remote client by measuring the relative distribution over their neighbors using a kernel density estimation method. In phase II, an optimal threshold is approximated to distinguish malicious and clean updates from a statistical perspective. Comprehensive experiments on four real-world datasets have been conducted, and the experimental results show that our defense strategy can effectively protect the FL system. {Specifically, the defense performance on Amazon dataset under a label-flipping attack indicates that, compared with FG+Krum, LoMar increases the target label testing accuracy from $96.0\%$ to $98.8\%$, and the overall averaged testing accuracy from $90.1\%$ to $97.0\%$.
    Global Convergence Analysis of Deep Linear Networks with A One-neuron Layer. (arXiv:2201.02761v1 [cs.LG])
    (2 min) In this paper, we follow Eftekhari's work to give a non-local convergence analysis of deep linear networks. Specifically, we consider optimizing deep linear networks which have a layer with one neuron under quadratic loss. We describe the convergent point of trajectories with arbitrary starting point under gradient flow, including the paths which converge to one of the saddle points or the original point. We also show specific convergence rates of trajectories that converge to the global minimizer by stages. To achieve these results, this paper mainly extends the machinery in Eftekhari's work to provably identify the rank-stable set and the global minimizer convergent set. We also give specific examples to show the necessity of our definitions. Crucially, as far as we know, our results appear to be the first to give a non-local global analysis of linear neural networks from arbitrary initialized points, rather than the lazy training regime which has dominated the literature of neural networks, and restricted benign initialization in Eftekhari's work. We also note that extending our results to general linear networks without one hidden neuron assumption remains a challenging open problem.
    BottleFit: Learning Compressed Representations in Deep Neural Networks for Effective and Efficient Split Computing. (arXiv:2201.02693v1 [cs.LG])
    (2 min) Although mission-critical applications require the use of deep neural networks (DNNs), their continuous execution at mobile devices results in a significant increase in energy consumption. While edge offloading can decrease energy consumption, erratic patterns in channel quality, network and edge server load can lead to severe disruption of the system's key operations. An alternative approach, called split computing, generates compressed representations within the model (called "bottlenecks"), to reduce bandwidth usage and energy consumption. Prior work has proposed approaches that introduce additional layers, to the detriment of energy consumption and latency. For this reason, we propose a new framework called BottleFit, which, in addition to targeted DNN architecture modifications, includes a novel training strategy to achieve high accuracy even with strong compression rates. We apply BottleFit on cutting-edge DNN models in image classification, and show that BottleFit achieves 77.1% data compression with up to 0.6% accuracy loss on ImageNet dataset, while state of the art such as SPINN loses up to 6% in accuracy. We experimentally measure the power consumption and latency of an image classification application running on an NVIDIA Jetson Nano board (GPU-based) and a Raspberry PI board (GPU-less). We show that BottleFit decreases power consumption and latency respectively by up to 49% and 89% with respect to (w.r.t.) local computing and by 37% and 55% w.r.t. edge offloading. We also compare BottleFit with state-of-the-art autoencoders-based approaches, and show that (i) BottleFit reduces power consumption and execution time respectively by up to 54% and 44% on the Jetson and 40% and 62% on Raspberry PI; (ii) the size of the head model executed on the mobile device is 83 times smaller. The code repository will be published for full reproducibility of the results.
    {\lambda}-Scaled-Attention: A Novel Fast Attention Mechanism for Efficient Modeling of Protein Sequences. (arXiv:2201.02912v1 [cs.LG])
    (2 min) Attention-based deep networks have been successfully applied on textual data in the field of NLP. However, their application on protein sequences poses additional challenges due to the weak semantics of the protein words, unlike the plain text words. These unexplored challenges faced by the standard attention technique include (i) vanishing attention score problem and (ii) high variations in the attention distribution. In this regard, we introduce a novel {\lambda}-scaled attention technique for fast and efficient modeling of the protein sequences that addresses both the above problems. This is used to develop the {\lambda}-scaled attention network and is evaluated for the task of protein function prediction implemented at the protein sub-sequence level. Experiments on the datasets for biological process (BP) and molecular function (MF) showed significant improvements in the F1 score values for the proposed {\lambda}-scaled attention technique over its counterpart approach based on the standard attention technique (+2.01% for BP and +4.67% for MF) and state-of-the-art ProtVecGen-Plus approach (+2.61% for BP and +4.20% for MF). Further, fast convergence (converging in half the number of epochs) and efficient learning (in terms of very low difference between the training and validation losses) were also observed during the training process.
    Lazy Lagrangians with Predictions for Online Learning. (arXiv:2201.02890v1 [cs.LG])
    (2 min) We consider the general problem of online convex optimization with time-varying additive constraints in the presence of predictions for the next cost and constraint functions. A novel primal-dual algorithm is designed by combining a Follow-The-Regularized-Leader iteration with prediction-adaptive dynamic steps. The algorithm achieves $\mathcal O(T^{\frac{3-\beta}{4}})$ regret and $\mathcal O(T^{\frac{1+\beta}{2}})$ constraint violation bounds that are tunable via parameter $\beta\!\in\![1/2,1)$ and have constant factors that shrink with the predictions quality, achieving eventually $\mathcal O(1)$ regret for perfect predictions. Our work extends the FTRL framework for this constrained OCO setting and outperforms the respective state-of-the-art greedy-based solutions, without imposing conditions on the quality of predictions, the cost functions or the geometry of constraints, beyond convexity.
    A Multi-agent Reinforcement Learning Approach for Efficient Client Selection in Federated Learning. (arXiv:2201.02932v1 [cs.LG])
    (2 min) Federated learning (FL) is a training technique that enables client devices to jointly learn a shared model by aggregating locally-computed models without exposing their raw data. While most of the existing work focuses on improving the FL model accuracy, in this paper, we focus on the improving the training efficiency, which is often a hurdle for adopting FL in real-world applications. Specifically, we design an efficient FL framework which jointly optimizes model accuracy, processing latency and communication efficiency, all of which are primary design considerations for real implementation of FL. Inspired by the recent success of Multi-Agent Reinforcement Learning (MARL) in solving complex control problems, we present \textit{FedMarl}, an MARL-based FL framework which performs efficient run-time client selection. Experiments show that FedMarl can significantly improve model accuracy with much lower processing latency and communication cost.
    An Improved Mathematical Model of Sepsis: Modeling, Bifurcation Analysis, and Optimal Control Study for Complex Nonlinear Infectious Disease System. (arXiv:2201.02702v1 [math.DS])
    (2 min) Sepsis is a life-threatening medical emergency, which is a major cause of death worldwide and the second highest cause of mortality in the United States. Researching the optimal control treatment or intervention strategy on the comprehensive sepsis system is key in reducing mortality. For this purpose, first, this paper improves a complex nonlinear sepsis model proposed in our previous work. Then, bifurcation analyses are conducted for each sepsis subsystem to study the model behaviors under some system parameters. The bifurcation analysis results also further indicate the necessity of control treatment and intervention therapy. If the sepsis system is without adding any control under some parameter and initial system value settings, the system will perform persistent inflammation outcomes as time goes by. Therefore, we develop our complex improved nonlinear sepsis model into a sepsis optimal control model, and then use some effective biomarkers recommended in existing clinic practices as optimization objective function to measure the development of sepsis. Besides that, a Bayesian optimization algorithm by combining Recurrent neural network (RNN-BO algorithm) is introduced to predict the optimal control strategy for the studied sepsis optimal control system. The difference between the RNN-BO algorithm from other optimization algorithms is that once given any new initial system value setting (initial value is associated with the initial conditions of patients), the RNN-BO algorithm is capable of quickly predicting a corresponding time-series optimal control based on the historical optimal control data for any new sepsis patient. To demonstrate the effectiveness and efficiency of the RNN-BO algorithm on solving the optimal control solution on the complex nonlinear sepsis system, some numerical simulations are implemented by comparing with other optimization algorithms in this paper.
    Unsupervised Machine Learning for Exploratory Data Analysis of Exoplanet Transmission Spectra. (arXiv:2201.02696v1 [astro-ph.EP])
    (2 min) Transit spectroscopy is a powerful tool to decode the chemical composition of the atmospheres of extrasolar planets. In this paper we focus on unsupervised techniques for analyzing spectral data from transiting exoplanets. We demonstrate methods for i) cleaning and validating the data, ii) initial exploratory data analysis based on summary statistics (estimates of location and variability), iii) exploring and quantifying the existing correlations in the data, iv) pre-processing and linearly transforming the data to its principal components, v) dimensionality reduction and manifold learning, vi) clustering and anomaly detection, vii) visualization and interpretation of the data. To illustrate the proposed unsupervised methodology, we use a well-known public benchmark data set of synthetic transit spectra. We show that there is a high degree of correlation in the spectral data, which calls for appropriate low-dimensional representations. We explore a number of different techniques for such dimensionality reduction and identify several suitable options in terms of summary statistics, principal components, etc. We uncover interesting structures in the principal component basis, namely, well-defined branches corresponding to different chemical regimes of the underlying atmospheres. We demonstrate that those branches can be successfully recovered with a K-means clustering algorithm in fully unsupervised fashion. We advocate for a three-dimensional representation of the spectroscopic data in terms of the first three principal components, in order to reveal the existing structure in the data and quickly characterize the chemical class of a planet.
    A Sneak Attack on Segmentation of Medical Images Using Deep Neural Network Classifiers. (arXiv:2201.02771v1 [eess.IV])
    (2 min) Instead of using current deep-learning segmentation models (like the UNet and variants), we approach the segmentation problem using trained Convolutional Neural Network (CNN) classifiers, which automatically extract important features from classified targets for image classification. Those extracted features can be visualized and formed heatmaps using Gradient-weighted Class Activation Mapping (Grad-CAM). This study tested whether the heatmaps could be used to segment the classified targets. We also proposed an evaluation method for the heatmaps; that is, to re-train the CNN classifier using images filtered by heatmaps and examine its performance. We used the mean-Dice coefficient to evaluate segmentation results. Results from our experiments show that heatmaps can locate and segment partial tumor areas. But only use of the heatmaps from CNN classifiers may not be an optimal approach for segmentation. In addition, we have verified that the predictions of CNN classifiers mainly depend on tumor areas, and dark regions in Grad-CAM's heatmaps also contribute to classification.
    Provable Clustering of a Union of Linear Manifolds Using Optimal Directions. (arXiv:2201.02745v1 [cs.LG])
    (2 min) This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed form algorithms for the subspace clustering problem. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and it is shown that both algorithms can be robust to notable intersections between the span of clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters.
    DeHIN: A Decentralized Framework for Embedding Large-scale Heterogeneous Information Networks. (arXiv:2201.02757v1 [cs.LG])
    (2 min) Modeling heterogeneity by extraction and exploitation of high-order information from heterogeneous information networks (HINs) has been attracting immense research attention in recent times. Such heterogeneous network embedding (HNE) methods effectively harness the heterogeneity of small-scale HINs. However, in the real world, the size of HINs grow exponentially with the continuous introduction of new nodes and different types of links, making it a billion-scale network. Learning node embeddings on such HINs creates a performance bottleneck for existing HNE methods that are commonly centralized, i.e., complete data and the model are both on a single machine. To address large-scale HNE tasks with strong efficiency and effectiveness guarantee, we present \textit{Decentralized Embedding Framework for Heterogeneous Information Network} (DeHIN) in this paper. In DeHIN, we generate a distributed parallel pipeline that utilizes hypergraphs in order to infuse parallelization into the HNE task. DeHIN presents a context preserving partition mechanism that innovatively formulates a large HIN as a hypergraph, whose hyperedges connect semantically similar nodes. Our framework then adopts a decentralized strategy to efficiently partition HINs by adopting a tree-like pipeline. Then, each resulting subnetwork is assigned to a distributed worker, which employs the deep information maximization theorem to locally learn node embeddings from the partition it receives. We further devise a novel embedding alignment scheme to precisely project independently learned node embeddings from all subnetworks onto a common vector space, thus allowing for downstream tasks like link prediction and node classification.
    Neighbor2vec: an efficient and effective method for Graph Embedding. (arXiv:2201.02626v1 [cs.SI])
    (2 min) Graph embedding techniques have led to significant progress in recent years. However, present techniques are not effective enough to capture the patterns of networks. This paper propose neighbor2vec, a neighbor-based sampling strategy used algorithm to learn the neighborhood representations of node, a framework to gather the structure information by feature propagation between the node and its neighbors. We claim that neighbor2vec is a simple and effective approach to enhancing the scalability as well as equality of graph embedding, and it breaks the limits of the existing state-of-the-art unsupervised techniques. We conduct experiments on several node classification and link prediction tasks for networks such as ogbn-arxiv, ogbn-products, ogbn-proteins, ogbl-ppa,ogbl-collab and ogbl-citation2. The result shows that Neighbor2vec's representations provide an average accuracy scores up to 6.8 percent higher than competing methods in node classification tasks and 3.0 percent higher in link prediction tasks. The neighbor2vec's representations are able to outperform all baseline methods and two classical GNN models in all six experiments.
    Attention Option-Critic. (arXiv:2201.02628v1 [cs.LG])
    (2 min) Temporal abstraction in reinforcement learning is the ability of an agent to learn and use high-level behaviors, called options. The option-critic architecture provides a gradient-based end-to-end learning method to construct options. We propose an attention-based extension to this framework, which enables the agent to learn to focus different options on different aspects of the observation space. We show that this leads to behaviorally diverse options which are also capable of state abstraction, and prevents the degeneracy problems of option domination and frequent option switching that occur in option-critic, while achieving a similar sample complexity. We also demonstrate the more efficient, interpretable, and reusable nature of the learned options in comparison with option-critic, through different transfer learning tasks. Experimental results in a relatively simple four-rooms environment and the more complex ALE (Arcade Learning Environment) showcase the efficacy of our approach.
    Machine Learning-Based Disease Diagnosis:A Bibliometric Analysis. (arXiv:2201.02755v1 [cs.LG])
    (2 min) Machine Learning (ML) has garnered considerable attention from researchers and practitioners as a new and adaptable tool for disease diagnosis. With the advancement of ML and the proliferation of papers and research in this field, a complete examination of Machine Learning-Based Disease Diagnosis (MLBDD) is required. From a bibliometrics standpoint, this article comprehensively studies MLBDD papers from 2012 to 2021. Consequently, with particular keywords, 1710 papers with associate information have been extracted from the Scopus and Web of Science (WOS) database and integrated into the excel datasheet for further analysis. First, we examine the publication structures based on yearly publications and the most productive countries/regions, institutions, and authors. Second, the co-citation networks of countries/regions, institutions, authors, and articles are visualized using R-studio software. They are further examined in terms of citation structure and the most influential ones. This article gives an overview of MLBDD for researchers interested in the subject and conducts a thorough and complete study of MLBDD for those interested in conducting more research in this field.
    A Fair and Efficient Hybrid Federated Learning Framework based on XGBoost for Distributed Power Prediction. (arXiv:2201.02783v1 [cs.LG])
    (2 min) In a modern power system, real-time data on power generation/consumption and its relevant features are stored in various distributed parties, including household meters, transformer stations and external organizations. To fully exploit the underlying patterns of these distributed data for accurate power prediction, federated learning is needed as a collaborative but privacy-preserving training scheme. However, current federated learning frameworks are polarized towards addressing either the horizontal or vertical separation of data, and tend to overlook the case where both are present. Furthermore, in mainstream horizontal federated learning frameworks, only artificial neural networks are employed to learn the data patterns, which are considered less accurate and interpretable compared to tree-based models on tabular datasets. To this end, we propose a hybrid federated learning framework based on XGBoost, for distributed power prediction from real-time external features. In addition to introducing boosted trees to improve accuracy and interpretability, we combine horizontal and vertical federated learning, to address the scenario where features are scattered in local heterogeneous parties and samples are scattered in various local districts. Moreover, we design a dynamic task allocation scheme such that each party gets a fair share of information, and the computing power of each party can be fully leveraged to boost training efficiency. A follow-up case study is presented to justify the necessity of adopting the proposed framework. The advantages of the proposed framework in fairness, efficiency and accuracy performance are also confirmed.
    Agricultural Plant Cataloging and Establishment of a Data Framework from UAV-based Crop Images by Computer Vision. (arXiv:2201.02885v1 [cs.CV])
    (2 min) UAV-based image retrieval in modern agriculture enables gathering large amounts of spatially referenced crop image data. In large-scale experiments, however, UAV images suffer from containing a multitudinous amount of crops in a complex canopy architecture. Especially for the observation of temporal effects, this complicates the recognition of individual plants over several images and the extraction of relevant information tremendously. In this work, we present a hands-on workflow for the automatized temporal and spatial identification and individualization of crop images from UAVs abbreviated as "cataloging" based on comprehensible computer vision methods. We evaluate the workflow on two real-world datasets. One dataset is recorded for observation of Cercospora leaf spot - a fungal disease - in sugar beet over an entire growing cycle. The other one deals with harvest prediction of cauliflower plants. The plant catalog is utilized for the extraction of single plant images seen over multiple time points. This gathers large-scale spatio-temporal image dataset that in turn can be applied to train further machine learning models including various data layers. The presented approach improves analysis and interpretation of UAV data in agriculture significantly. By validation with some reference data, our method shows an accuracy that is similar to more complex deep learning-based recognition techniques. Our workflow is able to automatize plant cataloging and training image extraction, especially for large datasets.
    A fall alert system with prior-fall activity identification. (arXiv:2201.02803v1 [cs.LG])
    (2 min) Falling, especially in the elderly, is a critical issue to care for and surveil. There have been many studies focusing on fall detection. However, from our survey, there is still no research indicating the prior-fall activities, which we believe that they have a strong correlation with the intensity of the fall. The purpose of this research is to develop a fall alert system that also identifies prior-fall activities. First, we want to find a suitable location to attach a sensor to the body. We created multiple-spot on-body devices to collect various activity data. We used that dataset to train 5 different classification models. We selected the XGBoost classification model for detecting a prior-fall activity and the chest location for use in fall detection from a comparison of the detection accuracy. We then tested 3 existing fall detection threshold algorithms to detect fall and fall to their knees first, and selected the 3-phase threshold algorithm of Chaitep and Chawachat [3] in our system. From the experiment, we found that the fall detection accuracy is 88.91%, the fall to their knees first detection accuracy is 91.25%, and the average accuracy of detection of prior-fall activities is 86.25%. Although we use an activity dataset of young to middle-aged adults (18-49 years), we are confident that this system can be developed to monitor activities before the fall, especially in the elderly, so that caretakers can better manage the situation.
  • stat.ML updates on arXiv.org

    When is Offline Two-Player Zero-Sum Markov Game Solvable?. (arXiv:2201.03522v1 [cs.LG])
    (2 min) We study what dataset assumption permits solving offline two-player zero-sum Markov game. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov game. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning.
    Understanding Layer-wise Contributions in Deep Neural Networks through Spectral Analysis. (arXiv:2111.03972v3 [cs.LG] UPDATED)
    (2 min) Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze the layer-wise spectral bias of Deep Neural Networks and relate it to the contributions of different layers in the reduction of generalization error for a given target function. We utilize the properties of Hermite polynomials and Spherical Harmonics to prove that initial layers exhibit a larger bias towards high-frequency functions defined on the unit sphere. We further provide empirical results validating our theory in high dimensional datasets for Deep Neural Networks.
    Masked Gradient-Based Causal Structure Learning. (arXiv:1910.08527v3 [cs.LG] UPDATED)
    (2 min) This paper studies the problem of learning causal structures from observational data. We reformulate the Structural Equation Model (SEM) with additive noises in a form parameterized by binary graph adjacency matrix and show that, if the original SEM is identifiable, then the binary adjacency matrix can be identified up to super-graphs of the true causal graph under mild conditions. We then utilize the reformulated SEM to develop a causal structure learning method that can be efficiently trained using gradient-based optimization, by leveraging a smooth characterization on acyclicity and the Gumbel-Softmax approach to approximate the binary adjacency matrix. It is found that the obtained entries are typically near zero or one and can be easily thresholded to identify the edges. We conduct experiments on synthetic and real datasets to validate the effectiveness of the proposed method, and show that it readily includes different smooth model functions and achieves a much improved performance on most datasets considered.
    On the Double Descent of Random Features Models Trained with SGD. (arXiv:2110.06910v4 [stat.ML] UPDATED)
    (2 min) This paper studies generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD). In this regime, we derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our analysis shows how to cope with multiple randomness sources of initialization, label noise, and data sampling (as well as stochastic gradients) with no closed-form solution, and also goes beyond the commonly-used Gaussian/spherical data assumption. Our theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias. Besides, we also prove that the constant step-size SGD setting incurs no loss in convergence rate when compared to the exact minimal-norm interpolator, as a theoretical justification of using SGD in practice.
    Optimal Experimental Design for Staggered Rollouts. (arXiv:1911.03764v3 [econ.EM] UPDATED)
    (2 min) In this paper, we study the problem of designing experiments that are conducted on a set of units such as users or groups of users in an online marketplace, for multiple time periods such as weeks or months. These experiments are particularly useful to study the treatments that have causal effects on both current and future outcomes (instantaneous and lagged effects). The design problem involves selecting a treatment time for each unit, before or during the experiment, in order to most precisely estimate the instantaneous and lagged effects, post experimentation. This optimization of the treatment decisions can directly minimize the opportunity cost of the experiment by reducing its sample size requirement. The optimization is an NP-hard integer program for which we provide a near-optimal solution, when the design decisions are performed all at the beginning (fixed-sample-size designs). Next, we study sequential experiments that allow adaptive decisions during the experiments, and also potentially early stop the experiments, further reducing their cost. However, the sequential nature of these experiments complicates both the design phase and the estimation phase. We propose a new algorithm, PGAE, that addresses these challenges by adaptively making treatment decisions, estimating the treatment effects, and drawing valid post-experimentation inference. PGAE combines ideas from Bayesian statistics, dynamic programming, and sample splitting. Using synthetic experiments on real data sets from multiple domains, we demonstrate that our proposed solutions for fixed-sample-size and sequential experiments reduce the opportunity cost of the experiments by over 50% and 70%, respectively, compared to benchmarks.
    Comparing Sequential Forecasters. (arXiv:2110.00115v3 [stat.ME] UPDATED)
    (2 min) Consider two or more forecasters, each making a sequence of predictions for different events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts or outcomes were generated? This work presents a novel and rigorous answer to this question. We design a sequential inference procedure for estimating the time-varying difference in forecast quality as measured by any scoring rule. The resulting confidence intervals are nonasymptotically valid and can be continuously monitored to yield statistically valid comparisons at arbitrary data-dependent stopping times ("anytime-valid"); this is enabled by adapting variance-adaptive supermartingales, confidence sequences, and e-processes to our setting. Motivated by Shafer and Vovk's game-theoretic probability, our coverage guarantees are also distribution-free, in the sense that they make no distributional assumptions on the forecasts or outcomes. In contrast to a recent work by Henzi and Ziegel, our tools can sequentially test a weak null hypothesis about whether one forecaster outperforms another on average over time. We demonstrate their effectiveness by comparing probability forecasts on Major League Baseball (MLB) games and statistical postprocessing methods for ensemble weather forecasts.
    Bayesian Consistency with the Supremum Metric. (arXiv:2201.03447v1 [math.ST])
    (2 min) We present simple conditions for Bayesian consistency in the supremum metric. The key to the technique is a triangle inequality which allows us to explicitly use weak convergence, a consequence of the standard Kullback--Leibler support condition for the prior. A further condition is to ensure that smoothed versions of densities are not too far from the original density, thus dealing with densities which could track the data too closely. A key result of the paper is that we demonstrate supremum consistency using weaker conditions compared to those currently used to secure $\mathbb{L}_1$ consistency.
    Traversing the Local Polytopes of ReLU Neural Networks: A Unified Approach for Network Verification. (arXiv:2111.08922v2 [cs.LG] UPDATED)
    (2 min) Although neural networks (NNs) with ReLU activation functions have found success in a wide range of applications, their adoption in risk-sensitive settings has been limited by the concerns on robustness and interpretability. Previous works to examine robustness and to improve interpretability partially exploited the piecewise linear function form of ReLU NNs. In this paper, we explore the unique topological structure that ReLU NNs create in the input space, identifying the adjacency among the partitioned local polytopes and developing a traversing algorithm based on this adjacency. Our polytope traversing algorithm can be adapted to verify a wide range of network properties related to robustness and interpretability, providing an unified approach to examine the network behavior. As the traversing algorithm explicitly visits all local polytopes, it returns a clear and full picture of the network behavior within the traversed region. The time and space complexity of the traversing algorithm is determined by the number of a ReLU NN's partitioning hyperplanes passing through the traversing region.
    Cluster Regularization via a Hierarchical Feature Regression. (arXiv:2107.04831v2 [stat.ML] UPDATED)
    (2 min) This paper proposes a novel graph-based regularized regression estimator - the hierarchical feature regression (HFR) -, which mobilizes insights from the domains of machine learning and graph theory to estimate robust parameters for a linear regression. The estimator constructs a supervised feature graph that decomposes parameters along its edges, adjusting first for common variation and successively incorporating idiosyncratic patterns into the fitting process. The graph structure has the effect of shrinking parameters towards group targets, where the extent of shrinkage is governed by a hyperparamter, and group compositions as well as shrinkage targets are determined endogenously. The method offers rich resources for the visual exploration of the latent effect structure in the data, and demonstrates good predictive accuracy and versatility when compared to a panel of commonly used regularization techniques across a range of empirical and simulated regression tasks.
    Stability Based Generalization Bounds for Exponential Family Langevin Dynamics. (arXiv:2201.03064v1 [cs.LG])
    (2 min) We study generalization bounds for noisy stochastic mini-batch iterative algorithms based on the notion of stability. Recent years have seen key advances in data-dependent generalization bounds for noisy iterative learning algorithms such as stochastic gradient Langevin dynamics (SGLD) based on stability (Mou et al., 2018; Li et al., 2020) and information theoretic approaches (Xu and Raginsky, 2017; Negrea et al., 2019; Steinke and Zakynthinou, 2020; Haghifam et al., 2020). In this paper, we unify and substantially generalize stability based generalization bounds and make three technical advances. First, we bound the generalization error of general noisy stochastic iterative algorithms (not necessarily gradient descent) in terms of expected (not uniform) stability. The expected stability can in turn be bounded by a Le Cam Style Divergence. Such bounds have a O(1/n) sample dependence unlike many existing bounds with O(1/\sqrt{n}) dependence. Second, we introduce Exponential Family Langevin Dynamics(EFLD) which is a substantial generalization of SGLD and which allows exponential family noise to be used with stochastic gradient descent (SGD). We establish data-dependent expected stability based generalization bounds for general EFLD algorithms. Third, we consider an important special case of EFLD: noisy sign-SGD, which extends sign-SGD using Bernoulli noise over {-1,+1}. Generalization bounds for noisy sign-SGD are implied by that of EFLD and we also establish optimization guarantees for the algorithm. Further, we present empirical results on benchmark datasets to illustrate that our bounds are non-vacuous and quantitatively much sharper than existing bounds.
    Permuted and Unlinked Monotone Regression in $\mathbb{R}^d$: an approach based on mixture modeling and optimal transport. (arXiv:2201.03528v1 [math.ST])
    (2 min) Suppose that we have a regression problem with response variable Y in $\mathbb{R}^d$ and predictor X in $\mathbb{R}^d$, for $d \geq 1$. In permuted or unlinked regression we have access to separate unordered data on X and Y, as opposed to data on (X,Y)-pairs in usual regression. So far in the literature the case $d=1$ has received attention, see e.g., the recent papers by Rigollet and Weed [Information & Inference, 8, 619--717] and Balabdaoui et al. [J. Mach. Learn. Res., 22(172), 1--60]. In this paper, we consider the general multivariate setting with $d \geq 1$. We show that the notion of cyclical monotonicity of the regression function is sufficient for identification and estimation in the permuted/unlinked regression model. We study permutation recovery in the permuted regression setting and develop a computationally efficient and easy-to-use algorithm for denoising based on the Kiefer-Wolfowitz [Ann. Math. Statist., 27, 887--906] nonparametric maximum likelihood estimator and techniques from the theory of optimal transport. We provide explicit upper bounds on the associated mean squared denoising error for Gaussian noise. As in previous work on the case $d = 1$, the permuted/unlinked setting involves slow (logarithmic) rates of convergence rooting in the underlying deconvolution problem. Numerical studies corroborate our theoretical analysis and show that the proposed approach performs at least on par with the methods in the aforementioned prior work in the case $d = 1$ while achieving substantial reductions in terms of computational complexity.
    Uncovering the Source of Machine Bias. (arXiv:2201.03092v1 [cs.LG])
    (2 min) We develop a structural econometric model to capture the decision dynamics of human evaluators on an online micro-lending platform, and estimate the model parameters using a real-world dataset. We find two types of biases in gender, preference-based bias and belief-based bias, are present in human evaluators' decisions. Both types of biases are in favor of female applicants. Through counterfactual simulations, we quantify the effect of gender bias on loan granting outcomes and the welfare of the company and the borrowers. Our results imply that both the existence of the preference-based bias and that of the belief-based bias reduce the company's profits. When the preference-based bias is removed, the company earns more profits. When the belief-based bias is removed, the company's profits also increase. Both increases result from raising the approval probability for borrowers, especially male borrowers, who eventually pay back loans. For borrowers, the elimination of either bias decreases the gender gap of the true positive rates in the credit risk evaluation. We also train machine learning algorithms on both the real-world data and the data from the counterfactual simulations. We compare the decisions made by those algorithms to see how evaluators' biases are inherited by the algorithms and reflected in machine-based decisions. We find that machine learning algorithms can mitigate both the preference-based bias and the belief-based bias.
    Introduction to Multi-Armed Bandits. (arXiv:1904.07272v7 [cs.LG] UPDATED)
    (2 min) Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a brief review of the further developments; many of the chapters conclude with exercises. The book is structured as follows. The first four chapters are on IID rewards, from the basic model to impossibility results to Bayesian priors to Lipschitz rewards. The next three chapters cover adversarial rewards, from the full-feedback version to adversarial bandits to extensions with linear rewards and combinatorially structured actions. Chapter 8 is on contextual bandits, a middle ground between IID and adversarial bandits in which the change in reward distributions is completely explained by observable contexts. The last three chapters cover connections to economics, from learning in repeated games to bandits with supply/budget constraints to exploration in the presence of incentives. The appendix provides sufficient background on concentration and KL-divergence. The chapters on "bandits with similarity information", "bandits with knapsacks" and "bandits and agents" can also be consumed as standalone surveys on the respective topics.
    Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: application to evolutionary modeling of clinical pathways. (arXiv:2004.01123v2 [cs.LG] UPDATED)
    (2 min) The paper proposes and investigates an approach for surrogate-assisted performance prediction of data-driven knowledge discovery algorithms. The approach is based on the identification of surrogate models for prediction of the target algorithm's quality and performance. The proposed approach was implemented and investigated as applied to an evolutionary algorithm for discovering clusters of interpretable clinical pathways in electronic health records of patients with acute coronary syndrome. Several clustering metrics and execution time were used as the target quality and performance metrics respectively. An analytical software prototype based on the proposed approach for the prediction of algorithm characteristics and feature analysis was developed to provide a more interpretable prediction of the target algorithm's performance and quality that can be further used for parameter tuning.
    Loss-calibrated expectation propagation for approximate Bayesian decision-making. (arXiv:2201.03128v1 [stat.ML])
    (2 min) Approximate Bayesian inference methods provide a powerful suite of tools for finding approximations to intractable posterior distributions. However, machine learning applications typically involve selecting actions, which -- in a Bayesian setting -- depend on the posterior distribution only via its contribution to expected utility. A growing body of work on loss-calibrated approximate inference methods has therefore sought to develop posterior approximations sensitive to the influence of the utility function. Here we introduce loss-calibrated expectation propagation (Loss-EP), a loss-calibrated variant of expectation propagation. This method resembles standard EP with an additional factor that "tilts" the posterior towards higher-utility decisions. We show applications to Gaussian process classification under binary utility functions with asymmetric penalties on False Negative and False Positive errors, and show how this asymmetry can have dramatic consequences on what information is "useful" to capture in an approximation.
    Individual Privacy Accounting via a Renyi Filter. (arXiv:2008.11193v4 [cs.CR] UPDATED)
    (2 min) We consider a sequential setting in which a single dataset of individuals is used to perform adaptively-chosen analyses, while ensuring that the differential privacy loss of each participant does not exceed a pre-specified privacy budget. The standard approach to this problem relies on bounding a worst-case estimate of the privacy loss over all individuals and all possible values of their data, for every single analysis. Yet, in many scenarios this approach is overly conservative, especially for "typical" data points which incur little privacy loss by participation in most of the analyses. In this work, we give a method for tighter privacy loss accounting based on the value of a personalized privacy loss estimate for each individual in each analysis. To implement the accounting method we design a filter for R\'enyi differential privacy. A filter is a tool that ensures that the privacy parameter of a composed sequence of algorithms with adaptively-chosen privacy parameters does not exceed a pre-specified budget. Our filter is simpler and tighter than the known filter for $(\epsilon,\delta)$-differential privacy by Rogers et al. We apply our results to the analysis of noisy gradient descent and show that personalized accounting can be practical, easy to implement, and can only make the privacy-utility tradeoff tighter.
    A Cross Validation framework for Signal Denoising with Applications to Trend Filtering, Dyadic CART and Beyond. (arXiv:2201.02654v1 [math.ST])
    (2 min) This paper formulates a general cross validation framework for signal denoising. The general framework is then applied to nonparametric regression methods such as Trend Filtering and Dyadic CART. The resulting cross validated versions are then shown to attain nearly the same rates of convergence as are known for the optimally tuned analogues. There did not exist any previous theoretical analyses of cross validated versions of Trend Filtering or Dyadic CART. To illustrate the generality of the framework we also propose and study cross validated versions of two fundamental estimators; lasso for high dimensional linear regression and singular value thresholding for matrix estimation. Our general framework is inspired by the ideas in Chatterjee and Jafarov (2015) and is potentially applicable to a wide range of estimation methods which use tuning parameters.
    Smooth Nested Simulation: Bridging Cubic and Square Root Convergence Rates in High Dimensions. (arXiv:2201.02958v1 [stat.ME])
    (2 min) Nested simulation concerns estimating functionals of a conditional expectation via simulation. In this paper, we propose a new method based on kernel ridge regression to exploit the smoothness of the conditional expectation as a function of the multidimensional conditioning variable. Asymptotic analysis shows that the proposed method can effectively alleviate the curse of dimensionality on the convergence rate as the simulation budget increases, provided that the conditional expectation is sufficiently smooth. The smoothness bridges the gap between the cubic root convergence rate (that is, the optimal rate for the standard nested simulation) and the square root convergence rate (that is, the canonical rate for the standard Monte Carlo simulation). We demonstrate the performance of the proposed method via numerical examples from portfolio risk management and input uncertainty quantification.
    Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. (arXiv:2201.02664v1 [cs.LG])
    (2 min) A significant bottleneck in federated learning is the network communication cost of sending model updates from client devices to the central server. We propose a method to reduce this cost. Our method encodes quantized updates with an appropriate universal code, taking into account their empirical distribution. Because quantization introduces error, we select quantization levels by optimizing for the desired trade-off in average total bitrate and gradient distortion. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds, and within each setting, distortion reliably predicts model performance. This allows for a remarkably simple compression scheme that is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on the Stack Overflow next-word prediction benchmark.
    Attention-based Random Forest and Contamination Model. (arXiv:2201.02880v1 [cs.LG])
    (2 min) A new approach called ABRF (the attention-based random forest) and its modifications for applying the attention mechanism to the random forest (RF) for regression and classification are proposed. The main idea behind the proposed ABRF models is to assign attention weights with trainable parameters to decision trees in a specific way. The weights depend on the distance between an instance, which falls into a corresponding leaf of a tree, and instances, which fall in the same leaf. This idea stems from representation of the Nadaraya-Watson kernel regression in the form of a RF. Three modifications of the general approach are proposed. The first one is based on applying the Huber's contamination model and on computing the attention weights by solving quadratic or linear optimization problems. The second and the third modifications use the gradient-based algorithms for computing trainable parameters. Numerical experiments with various regression and classification datasets illustrate the proposed method.
    Optimal 1-Wasserstein Distance for WGANs. (arXiv:2201.02824v1 [stat.ML])
    (2 min) The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes. We study the specific case where the latent space is univariate and derive results valid regardless of the dimension of the output space. We show in particular that for a fixed sample size, the optimal WGANs are closely linked with connected paths minimizing the sum of the squared Euclidean distances between the sample points. We also highlight the fact that WGANs are able to approach (for the 1-Wasserstein distance) the target distribution as the sample size tends to infinity, at a given convergence rate and provided the family of generative Lipschitz functions grows appropriately. We derive in passing new results on optimal transport theory in the semi-discrete setting.
    AIDA: An Active Inference-based Design Agent for Audio Processing Algorithms. (arXiv:2112.13366v2 [eess.AS] UPDATED)
    (2 min) In this paper we present AIDA, which is an active inference-based agent that iteratively designs a personalized audio processing algorithm through situated interactions with a human client. The target application of AIDA is to propose on-the-spot the most interesting alternative values for the tuning parameters of a hearing aid (HA) algorithm, whenever a HA client is not satisfied with their HA performance. AIDA interprets searching for the "most interesting alternative" as an issue of optimal (acoustic) context-aware Bayesian trial design. In computational terms, AIDA is realized as an active inference-based agent with an Expected Free Energy criterion for trial design. This type of architecture is inspired by neuro-economic models on efficient (Bayesian) trial design in brains and implies that AIDA comprises generative probabilistic models for acoustic signals and user responses. We propose a novel generative model for acoustic signals as a sum of time-varying auto-regressive filters and a user response model based on a Gaussian Process Classifier. The full AIDA agent has been implemented in a factor graph for the generative model and all tasks (parameter learning, acoustic context classification, trial design, etc.) are realized by variational message passing on the factor graph. All verification and validation experiments and demonstrations are freely accessible at our GitHub repository.
    Lazy Lagrangians with Predictions for Online Learning. (arXiv:2201.02890v1 [cs.LG])
    (2 min) We consider the general problem of online convex optimization with time-varying additive constraints in the presence of predictions for the next cost and constraint functions. A novel primal-dual algorithm is designed by combining a Follow-The-Regularized-Leader iteration with prediction-adaptive dynamic steps. The algorithm achieves $\mathcal O(T^{\frac{3-\beta}{4}})$ regret and $\mathcal O(T^{\frac{1+\beta}{2}})$ constraint violation bounds that are tunable via parameter $\beta\!\in\![1/2,1)$ and have constant factors that shrink with the predictions quality, achieving eventually $\mathcal O(1)$ regret for perfect predictions. Our work extends the FTRL framework for this constrained OCO setting and outperforms the respective state-of-the-art greedy-based solutions, without imposing conditions on the quality of predictions, the cost functions or the geometry of constraints, beyond convexity.
    Neural-PDE: A RNN based neural network for solving time dependent PDEs. (arXiv:2009.03892v3 [math.NA] UPDATED)
    (2 min) Partial differential equations (PDEs) play a crucial role in studying a vast number of problems in science and engineering. Numerically solving nonlinear and/or high-dimensional PDEs is often a challenging task. Inspired by the traditional finite difference and finite elements methods and emerging advancements in machine learning, we propose a sequence deep learning framework called Neural-PDE, which allows to automatically learn governing rules of any time-dependent PDE system from existing data by using a bidirectional LSTM encoder, and predict the next n time steps data. One critical feature of our proposed framework is that the Neural-PDE is able to simultaneously learn and simulate the multiscale variables.We test the Neural-PDE by a range of examples from one-dimensional PDEs to a high-dimensional and nonlinear complex fluids model. The results show that the Neural-PDE is capable of learning the initial conditions, boundary conditions and differential operators without the knowledge of the specific form of a PDE system.In our experiments the Neural-PDE can efficiently extract the dynamics within 20 epochs training, and produces accurate predictions. Furthermore, unlike the traditional machine learning approaches in learning PDE such as CNN and MLP which require vast parameters for model precision, Neural-PDE shares parameters across all time steps, thus considerably reduces the computational complexity and leads to a fast learning algorithm.
    SCROLLS: Standardized CompaRison Over Long Language Sequences. (arXiv:2201.03533v1 [cs.CL])
    (2 min) NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
    Accelerated Gradient Methods for Sparse Statistical Learning with Nonconvex Penalties. (arXiv:2009.10629v3 [math.OC] UPDATED)
    (2 min) Nesterov's accelerated gradient (AG) is a popular technique to optimize objective functions comprising two components: a convex loss and a penalty function. While AG methods perform well for convex penalties, such as the LASSO, convergence issues may arise when it is applied to nonconvex penalties, such as SCAD. A recent proposal generalizes Nesterov's AG method to the nonconvex setting but has never been applied to sparse statistical learning problems. There are several hyperparameters to be set before running the proposed algorithm. However, there is no explicit rule as to how the hyperparameters should be selected. In this article, we consider the application of this nonconvex AG algorithm to high-dimensional linear and logistic sparse learning problems, and propose a hyperparameter setting based on the complexity upper bound to accelerate convergence. We further establish the rate of convergence and present a simple and useful bound for the damping sequence. Simulation studies show that convergence can be made, on average, considerably faster than that of the conventional ISTA algorithm. Our experiments also show that the proposed method generally outperforms the current state-of-the-art method in terms of signal recovery.
    Robust classification with flexible discriminant analysis in heterogeneous data. (arXiv:2201.02967v1 [stat.ML])
    (2 min) Linear and Quadratic Discriminant Analysis are well-known classical methods but can heavily suffer from non-Gaussian distributions and/or contaminated datasets, mainly because of the underlying Gaussian assumption that is not robust. To fill this gap, this paper presents a new robust discriminant analysis where each data point is drawn by its own arbitrary Elliptically Symmetrical (ES) distribution and its own arbitrary scale parameter. Such a model allows for possibly very heterogeneous, independent but non-identically distributed samples. After deriving a new decision rule, it is shown that maximum-likelihood parameter estimation and classification are very simple, fast and robust compared to state-of-the-art methods.
    An application of the splitting-up method for the computation of a neural network representation for the solution for the filtering equations. (arXiv:2201.03283v1 [math.PR])
    (2 min) The filtering equations govern the evolution of the conditional distribution of a signal process given partial, and possibly noisy, observations arriving sequentially in time. Their numerical approximation plays a central role in many real-life applications, including numerical weather prediction, finance and engineering. One of the classical approaches to approximate the solution of the filtering equations is to use a PDE inspired method, called the splitting-up method, initiated by Gyongy, Krylov, LeGland, among other contributors. This method, and other PDE based approaches, have particular applicability for solving low-dimensional problems. In this work we combine this method with a neural network representation. The new methodology is used to produce an approximation of the unnormalised conditional distribution of the signal process. We further develop a recursive normalisation procedure to recover the normalised conditional distribution of the signal process. The new scheme can be iterated over multiple time steps whilst keeping its asymptotic unbiasedness property intact. We test the neural network approximations with numerical approximation results for the Kalman and Benes filter.
    Selecting the Best Optimizing System. (arXiv:2201.03065v1 [stat.ME])
    (2 min) We formulate selecting the best optimizing system (SBOS) problems and provide solutions for those problems. In an SBOS problem, a finite number of systems are contenders. Inside each system, a continuous decision variable affects the system's expected performance. An SBOS problem compares different systems based on their expected performances under their own optimally chosen decision to select the best, without advance knowledge of expected performances of the systems nor the optimizing decision inside each system. We design easy-to-implement algorithms that adaptively chooses a system and a choice of decision to evaluate the noisy system performance, sequentially eliminates inferior systems, and eventually recommends a system as the best after spending a user-specified budget. The proposed algorithms integrate the stochastic gradient descent method and the sequential elimination method to simultaneously exploit the structure inside each system and make comparisons across systems. For the proposed algorithms, we prove exponential rates of convergence to zero for the probability of false selection, as the budget grows to infinity. We conduct three numerical examples that represent three practical cases of SBOS problems. Our proposed algorithms demonstrate consistent and stronger performances in terms of the probability of false selection over benchmark algorithms under a range of problem settings and sampling budgets.
    Retiring Adult: New Datasets for Fair Machine Learning. (arXiv:2108.04884v3 [cs.LG] UPDATED)
    (2 min) Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity. Our primary contribution is a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. We create prediction tasks relating to income, employment, health, transportation, and housing. The data span multiple years and all states of the United States, allowing researchers to study temporal shift and geographic variation. We highlight a broad initial sweep of new empirical insights relating to trade-offs between fairness criteria, performance of algorithmic interventions, and the role of distribution shift based on our new datasets. Our findings inform ongoing debates, challenge some existing narratives, and point to future research directions. Our datasets are available at https://github.com/zykls/folktables.
    Differentially Private $\ell_1$-norm Linear Regression with Heavy-tailed Data. (arXiv:2201.03204v1 [cs.LG])
    (2 min) We study the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) with heavy-tailed data. Specifically, we focus on the $\ell_1$-norm linear regression in the $\epsilon$-DP model. While most of the previous work focuses on the case where the loss function is Lipschitz, here we only need to assume the variates has bounded moments. Firstly, we study the case where the $\ell_2$ norm of data has bounded second order moment. We propose an algorithm which is based on the exponential mechanism and show that it is possible to achieve an upper bound of $\tilde{O}(\sqrt{\frac{d}{n\epsilon}})$ (with high probability). Next, we relax the assumption to bounded $\theta$-th order moment with some $\theta\in (1, 2)$ and show that it is possible to achieve an upper bound of $\tilde{O}(({\frac{d}{n\epsilon}})^\frac{\theta-1}{\theta})$. Our algorithms can also be extended to more relaxed cases where only each coordinate of the data has bounded moments, and we can get an upper bound of $\tilde{O}({\frac{d}{\sqrt{n\epsilon}}})$ and $\tilde{O}({\frac{d}{({n\epsilon})^\frac{\theta-1}{\theta}}})$ in the second and $\theta$-th moment case respectively.
    Learning polytopes with fixed facet directions. (arXiv:2201.03419v1 [math.MG])
    (2 min) We consider the task of reconstructing polytopes with fixed facet directions from finitely many support function evaluations. We show that for fixed simplicial normal fan the least-squares estimate is given by a convex quadratic program. We study the geometry of the solution set and give a combinatorial characterization for the uniqueness of the reconstruction in this case. We provide an algorithm that, under mild assumptions, converges to the unknown input shape as the number of noisy support function evaluations increases. We also discuss limitations of our results if the restriction on the normal fan is removed.
    Effective Sample Size, Dimensionality, and Generalization in Covariate Shift Adaptation. (arXiv:2010.01184v5 [stat.ML] UPDATED)
    (2 min) In supervised learning, training and test datasets are often sampled from distinct distributions. Domain adaptation techniques are thus required. Covariate shift adaptation yields good generalization performance when domains differ only by the marginal distribution of features. Covariate shift adaptation is usually implemented using importance weighting, which may fail, according to common wisdom, due to small effective sample sizes (ESS). Previous research argues this scenario is more common in high-dimensional settings. However, how effective sample size, dimensionality, and model performance/generalization are formally related in supervised learning, considering the context of covariate shift adaptation, is still somewhat obscure in the literature. Thus, a main challenge is presenting a unified theory connecting those points. Hence, in this paper, we focus on building a unified view connecting the ESS, data dimensionality, and generalization in the context of covariate shift adaptation. Moreover, we also demonstrate how dimensionality reduction or feature selection can increase the ESS, and argue that our results support dimensionality reduction before covariate shift adaptation as a good practice.
    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. (arXiv:2201.03544v1 [cs.LG])
    (2 min) Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.
    A Survey of Uncertainty in Deep Neural Networks. (arXiv:2107.03342v2 [cs.LG] UPDATED)
    (2 min) Due to their increasing spread, confidence in neural network predictions became more and more important. However, basic neural networks do not deliver certainty estimates or suffer from over or under confidence. Many researchers have been working on understanding and quantifying uncertainty in a neural network's prediction. As a result, different types and sources of uncertainty have been identified and a variety of approaches to measure and quantify uncertainty in neural networks have been proposed. This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges, and identifies potential research opportunities. It is intended to give anyone interested in uncertainty estimation in neural networks a broad overview and introduction, without presupposing prior knowledge in this field. A comprehensive introduction to the most crucial sources of uncertainty is given and their separation into reducible model uncertainty and not reducible data uncertainty is presented. The modeling of these uncertainties based on deterministic neural networks, Bayesian neural networks, ensemble of neural networks, and test-time data augmentation approaches is introduced and different branches of these fields as well as the latest developments are discussed. For a practical application, we discuss different measures of uncertainty, approaches for the calibration of neural networks and give an overview of existing baselines and implementations. Different examples from the wide spectrum of challenges in different fields give an idea of the needs and challenges regarding uncertainties in practical applications. Additionally, the practical limitations of current methods for mission- and safety-critical real world applications are discussed and an outlook on the next steps towards a broader usage of such methods is given.
    Optimal radial basis for density-based atomic representations. (arXiv:2105.08717v2 [stat.ML] UPDATED)
    (2 min) The input of almost every machine learning algorithm targeting the properties of matter at the atomic scale involves a transformation of the list of Cartesian atomic coordinates into a more symmetric representation. Many of the most popular representations can be seen as an expansion of the symmetrized correlations of the atom density, and differ mainly by the choice of basis. Considerable effort has been dedicated to the optimization of the basis set, typically driven by heuristic considerations on the behavior of the regression target. Here we take a different, unsupervised viewpoint, aiming to determine the basis that encodes in the most compact way possible the structural information that is relevant for the dataset at hand. For each training dataset and number of basis functions, one can determine a unique basis that is optimal in this sense, and can be computed at no additional cost with respect to the primitive basis by approximating it with splines. We demonstrate that this construction yields representations that are accurate and computationally efficient, particularly when constructing representations that correspond to high-body order correlations. We present examples that involve both molecular and condensed-phase machine-learning models.
    A Tutorial on Learning With Bayesian Networks. (arXiv:2002.00269v3 [cs.LG] UPDATED)
    (2 min) A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study.
    Provable Clustering of a Union of Linear Manifolds Using Optimal Directions. (arXiv:2201.02745v1 [cs.LG])
    (2 min) This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed form algorithms for the subspace clustering problem. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and it is shown that both algorithms can be robust to notable intersections between the span of clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters.
    Trade-offs between membership privacy & adversarially robust learning. (arXiv:2006.04622v2 [cs.LG] UPDATED)
    (2 min) Historically, machine learning methods have not been designed with security in mind. In turn, this has given rise to adversarial examples, carefully perturbed input samples aimed to mislead detection at test time, which have been applied to attack spam and malware classification, and more recently to attack image classification. Consequently, an abundance of research has been devoted to designing machine learning methods that are robust to adversarial examples. Unfortunately, there are desiderata besides robustness that a secure and safe machine learning model must satisfy, such as fairness and privacy. Recent work by Song et al. (2019) has shown, empirically, that there exists a trade-off between robust and private machine learning models. Models designed to be robust to adversarial examples often overfit on training data to a larger extent than standard (non-robust) models. If a dataset contains private information, then any statistical test that separates training and test data by observing a model's outputs can represent a privacy breach, and if a model overfits on training data, these statistical tests become easier. In this work, we identify settings where standard models will overfit to a larger extent in comparison to robust models, and as empirically observed in previous works, settings where the opposite behavior occurs. Thus, it is not necessarily the case that privacy must be sacrificed to achieve robustness. The degree of overfitting naturally depends on the amount of data available for training. We go on to characterize how the training set size factors into the privacy risks exposed by training a robust model on a simple Gaussian data task, and show empirically that our findings hold on image classification benchmark datasets, such as CIFAR-10 and CIFAR-100.
    Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks. (arXiv:2007.02901v2 [cs.LG] UPDATED)
    (2 min) We present Wiki-CS, a novel dataset derived from Wikipedia for benchmarking Graph Neural Networks. The dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field. We use the dataset to evaluate semi-supervised node classification and single-relation link prediction models. Our experiments show that these methods perform well on a new domain, with structural properties different from earlier benchmarks. The dataset is publicly available, along with the implementation of the data pipeline and the benchmark experiments, at https://github.com/pmernyei/wiki-cs-dataset .
    L\'evy Induced Stochastic Differential Equation Equipped with Neural Network for Time Series Forecasting. (arXiv:2111.13164v2 [cs.LG] UPDATED)
    (2 min) With the fast development of modern deep learning techniques, the study of dynamic systems and neural networks is increasingly benefiting each other in a lot of different ways. Since uncertainties often arise in real world observations, SDEs (stochastic differential equations) come to play an important role. To be more specific, in this paper, we use a collection of SDEs equipped with neural networks to predict long-term trend of noisy time series which has big jump properties and high probability distribution shift. Our contributions are, first, we explored SDEs driven by $\alpha$-stable L\'evy motion to model the time series data and solved the problem through neural network approximation. Second, we theoretically proved the convergence of the model and obtained the convergence rate. Finally, we illustrated our method by applying it to stock marketing time series prediction and found the convergence order of error.
    Subset Selection with Shrinkage: Sparse Linear Modeling when the SNR is low. (arXiv:1708.03288v4 [stat.ME] UPDATED)
    (2 min) We study a seemingly unexpected and relatively less understood overfitting aspect of a fundamental tool in sparse linear modeling - best subset selection, which minimizes the residual sum of squares subject to a constraint on the number of nonzero coefficients. While the best subset selection procedure is often perceived as the "gold standard" in sparse learning when the signal to noise ratio (SNR) is high, its predictive performance deteriorates when the SNR is low. In particular, it is outperformed by continuous shrinkage methods, such as ridge regression and the Lasso. We investigate the behavior of best subset selection in the high-noise regimes and propose an alternative approach based on a regularized version of the least-squares criterion. Our proposed estimators (a) mitigate, to a large extent, the poor predictive performance of best subset selection in the high-noise regimes; and (b) perform favorably, while generally delivering substantially sparser models, relative to the best predictive models available via ridge regression and the Lasso. We conduct an extensive theoretical analysis of the predictive properties of the proposed approach and provide justification for its superior predictive performance relative to best subset selection when the noise-level is high. Our estimators can be expressed as solutions to mixed integer second order conic optimization problems and, hence, are amenable to modern computational tools from mathematical optimization.
    Enhancing Haptic Distinguishability of Surface Materials with Boosting Technique. (arXiv:2010.02002v4 [cs.LG] UPDATED)
    (2 min) Discriminative features are crucial for several learning applications, such as object detection and classification. Neural networks are extensively used for extracting discriminative features of images and speech signals. However, the lack of large datasets in the haptics domain often limits the applicability of such techniques. This paper presents a general framework for the analysis of the discriminative properties of haptic signals. We demonstrate the effectiveness of spectral features and a boosted embedding technique in enhancing the distinguishability of haptic signals. Experiments indicate our framework needs less training data, generalizes well for different predictors, and outperforms the related state-of-the-art.
    Non-Asymptotic Guarantees for Robust Statistical Learning under $(1+\varepsilon)$-th Moment Assumption. (arXiv:2201.03182v1 [stat.ML])
    (2 min) There has been a surge of interest in developing robust estimators for models with heavy-tailed data in statistics and machine learning. This paper proposes a log-truncated M-estimator for a large family of statistical regressions and establishes its excess risk bound under the condition that the data have $(1+\varepsilon)$-th moment with $\varepsilon \in (0,1]$. With an additional assumption on the associated risk function, we obtain an $\ell_2$-error bound for the estimation. Our theorems are applied to establish robust M-estimators for concrete regressions. Besides convex regressions such as quantile regression and generalized linear models, many non-convex regressions can also be fit into our theorems, we focus on robust deep neural network regressions, which can be solved by the stochastic gradient descent algorithms. Simulations and real data analysis demonstrate the superiority of log-truncated estimations over standard estimations.
  • The Official NVIDIA Blog

    World Record-Setting DNA Sequencing Technique Helps Clinicians Rapidly Diagnose Critical Care Patients
    (4 min) Cutting down the time needed to sequence and analyze a patient’s whole genome from days to hours isn’t just about clinical efficiency — it can save lives. By accelerating every step of this process — from collecting a blood sample to sequencing the whole genome to identifying variants linked to diseases — a research team Read article > The post World Record-Setting DNA Sequencing Technique Helps Clinicians Rapidly Diagnose Critical Care Patients appeared first on The Official NVIDIA Blog.
    Elevated Entertainment: SHIELD Experience 9.0 Upgrade Rolling Out Now
    (3 min) SHIELD Software Experience Upgrade 9.0 is rolling out to all NVIDIA SHIELD TVs, delivering the Android 11 operating system and more. An updated Gboard — the Google Keyboard — allows people to use their voices and the Google Assistant to discover content in all search boxes. Additional permissions let users customize privacy across apps, including Read article > The post Elevated Entertainment: SHIELD Experience 9.0 Upgrade Rolling Out Now appeared first on The Official NVIDIA Blog.
  • AWS Machine Learning Blog

    Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker
    (8 min) In deep learning, batch processing refers to feeding multiple inputs into a model. Although it’s essential during training, it can be very helpful to manage the cost and optimize throughput during inference time as well. Hardware accelerators are optimized for parallelism, and batching helps saturate the compute capacity and often leads to higher throughput. Batching […]
    Graph-based recommendation system with Neptune ML: An illustration on social network link prediction challenges
    (13 min) Recommendation systems are one of the most widely adopted machine learning (ML) technologies in real-world applications, ranging from social networks to ecommerce platforms. Users of many online systems rely on recommendation systems to make new friendships, discover new music according to suggested music lists, or even make ecommerce purchase decisions based on the recommended products. […]
    Secure access to Amazon SageMaker Studio with AWS SSO and a SAML application
    (9 min) Cloud security at AWS is the highest priority. Amazon SageMaker Studio offers various mechanisms to protect your data and code using integration with AWS security services like AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), or network isolation with Amazon Virtual Private Cloud (Amazon VPC). Customers in highly regulated industries, like […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Smart hospitals: the market overview, trends, and considerations
    (10 min) The pandemic has uncovered and amplified the struggles the healthcare sector is facing. From huge doctor workloads to unsatisfied patients…
    Transfer Learning — Part — 5.2!! Implementing ResNet in PyTorch
    (35 min) In Part 5.0 of the Transfer Learning series we have discussed about ResNet pre-trained model in depth so in this series we will implement…
    How to create a recommendation model with help of Machine Learning Techniques and model Ensembling
    (9 min) Sector: Agriculture, Project: “InteliCrop: An Ensemble Model to Predict Crop using Machine Learning Algorithms”
  • John D. Cook

    Queueing theory equations
    (2 min) A blog post about queueing theory that I wrote back in 2008 continues to be popular. The post shows that when a system is close to capacity, adding another server dramatically reduces the wait time. In that example, going from one teller to two tellers doesn’t make service twice as fast but 93 times as […] Queueing theory equations first appeared on John D. Cook.
    Swanson’s rule of thumb
    (2 min) Swanson’s rule of thumb [1] says that the mean of a moderately skewed probability distribution can be approximated by the weighted average of the 10th, 50th, and 90th percentile, with weights 0.3, 0.4, and 0.3 respectively. Because it is based on percentiles, the rule is robust to outliers. Swanson’s rule is used in the oil […] Swanson’s rule of thumb first appeared on John D. Cook.
  • Microsoft Research

    EzPC: Increased data security in the AI model validation process
    (7 min) From manufacturing and logistics to agriculture and transportation, the expansion of artificial intelligence (AI) in the last decade has revolutionized a multitude of industries—examples include enhancing predictive analytics on the manufacturing floor and making microclimate predictions so that farmers can respond and save their crops in time. The adoption of AI is expected to accelerate […] The post EzPC: Increased data security in the AI model validation process appeared first on Microsoft Research.
  • Artificial Intelligence

    Lior Cole Is the Model Combining Artificial Intelligence With Religion
    (1 min) submitted by /u/mr_j_b [link] [comments]
    Artificial intelligence to influence top tech trends in major way in next five years
    (0 min) submitted by /u/mr_j_b [link] [comments]
    How A.I. is set to evolve in 2022
    (0 min) submitted by /u/mr_j_b [link] [comments]
    What is the state of AI? This is the question I try to answer on my blog monthly, hoping to provide valuable information and insights to our community and those outside the field.
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    OneFlow v0.6.0 just came out![P]
    (0 min) submitted by /u/Just0by [link] [comments]
    Ai made Art
    (0 min) submitted by /u/deepnskate [link] [comments]
  • Machine Learning

    [D] Is AI research reaching saturation?
    (3 min) It feels like every topic is heavily researched. There are 100 new papers everyday mentioning small improvements to previous architectures/models with some tweaking. Every task has a near 100% accuracy. Is AI research reaching saturation? What do you guys work on, and how competitive is it in your area? How do you keep up with the several papers cpming in everyday? submitted by /u/Bibbidi_Babbidi_Boo [link] [comments]
    [D] Classical Clustering Benchmarks
    (1 min) Besides for MNIST, what datasets are generally standard to test a method out on? I am looking to test out a non-deep clustering method on various benchmarks. When I look at papers accepted into top ML conferences like NeurIPS, ICML and ICLR, there doesn't appear to be a consistent set of benchmarks. Is there a standard set of clustering benchmarks one is supposed to deploy new methods on? ( I would not expect it this method I'm testing to perform well on e.g. CIFAR-10. My impression is that almost all non-deep methods that aren't custom-tuned to images perform poorly on CIFAR-10) submitted by /u/Grand_Distribution83 [link] [comments]
    [D] [R] large query set few shot learning
    (1 min) Is there a research field specific to image classification/ embedding learning where we have a large number of classes +2k and only few samples per class ~1-5 ? What i noticed in few shot learning papers is the query set is always very small 20-way 1-shot or something similar. Is there any paper you have seen that i can check ? Thank you! submitted by /u/flow_smith [link] [comments]
    [R] ConvNets vs Transformers
    (4 min) A ConvNet for the 2020s - nice read to start 2022. The authors explore modernizations of Resnets and adopt some tricks from transformers training design to make ConvNets great again. There is a lot to reflect and thing about. Code is here. https://preview.redd.it/kqnqe86729b81.png?width=2696&format=png&auto=webp&s=ac0a4f045c61c34756cfcce3073792ace8f64301 submitted by /u/AdelSexy [link] [comments]
    [D] How good are Australian Universities for Deep Learning/AI?
    (1 min) Hey. I was thinking of pursuing my Masters from Australia but am still confused about upto what extent they pursue deep learning. Like I was going through University of Adelaide and found they have their dedicated Computer Vision lab among others which has also produced some good SOTA results in the past. But apart from that, what other universities would the community recommend? Or are there any certain professors which could be of interest. Any suggestions are welcome :) submitted by /u/Unitrix247 [link] [comments]
    [D] GAN training initially degrades results of pre-trained generator
    (1 min) I have an issue with the training of a GAN, which consists of a generator and two discriminators. The generator is used to generate waveforms. 1-The generator is independently pre-trained by regression, up to 400k steps. 2-The two randomly initialized discriminators are then activated, and GANs training takes place. The orange curve, represents the l1-norm between input and output: the l1-norm steeply rises for the first few validation steps. In the blue curve, I tried freezing the generator for 80k steps, to allow pre-training of the discriminators, but the problems persists. However, step 2 inititially degrades the output of the generator, by removing too much information. This is also reflected in all the validation losses (see image above), whose values get worse. The GAN is based or a variation of the code below: https://github.com/rishikksh20/hifigan-denoiser/blob/master/train.py Any suggestions? submitted by /u/alf_Lafleur [link] [comments]
    [P] Tutorial: Time Series Cross-Validation
    (2 min) [Tutorial] - Time Series Cross Validation TL;DR: The code is here. Time Series Forecasting can be overwhelming. Especially if you are just getting started. There are many different types of Time Series tasks each differs by the number of input or output sequences, the number of steps to predict, whether the input and/or the output sequence length is static or changing, and so on. In this notebook, we will experiment with different types of Time Series Cross-Validation Strategies in order to become familiar with them and understand which works best for what case. As written before, Time Series problems can be of different variations, so in order to get a deeper understanding, we should explore each. Variation I: Number of Input / Output sequences (1.0) Single input and single…
    [D] Faster Optimization using GPU Multiprocessing
    (1 min) GPU Multiprocessing for Parameter Optimization TL;DR: The code is here Hi Guys, I published a notebook using a trick for running multiprocessing on GPU devices. Might be interesting for some here, so have fun: As we know, GPU makes everything faster. Moreover, oftentimes the GPU device we use is so powerful that our code doesn't utilize the GPU code to 100% potential. In such cases, running multiple GPU instances in parallel can come in handy and save us some extra time. I've made a notebook that introduces a method for running parallel jobs on GPU devices. It is a bit tricky since most packages that support GPU usage aren't built from the ground up for such use cases. The "trick" is simple: We simply do not call the GPU in the main process, the first time we call for any GPU utilization should be from the child processes. ![](https://i.ibb.co/VDpCKT4/gpu-multiprocessing.png) To demonstrate the power of this concept, I made a notebook that runs Parallel Hyperparameters Optimization on the GPU for LightGBM. As it is assumed, the GPU enables us to train our models faster (much faster) and by leveraging that in combination with the parallel execution we get to search a large space of parameter configurations for fully maximizing our model's performance (Since we can now try many combinations). The Optimization framework is optuna, the popular framework for optimizing the hyperparameters. Again, this is just an example of the approach. You can use the same approach for many other GPU-enabled use cases. Personally, This concept helped me over and over as it is easily transferable to many different GPU-based models including Neural Networks (Tensorflow, Pytorch..), Other GBM models (Catboost, XGBoost..), Feature Engineering (Pandas, CuPy), and more. submitted by /u/yamqwe [link] [comments]
    [R] Julia developers discuss the current state of ML tools in Julia compared to Python
    (1 min) https://discourse.julialang.org/t/state-of-machine-learning-in-julia/74385/18 The developers of some of the largest Julia language packages discuss the current state of ML in Julia, and compare and contrast its status with the Python ML ecosystem. submitted by /u/kdfn [link] [comments]
    [D] Question to ML developers: Do you split your work with a programmer?
    (1 min) Like you do the theory (maths and…) and give it to a programmer to deploy your thoughts. Is it worth your time? submitted by /u/Secure_Pomegranate10 [link] [comments]
    [R] Automated Reinforcement Learning (AutoRL): A Survey and Open Problems
    (1 min) Paper: https://arxiv.org/abs/2201.03916. First survey on the recently developing field of AutoRL. Abstract: The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path towards generally capable agents. However, the success of RL agents is often highly sensitive to design choices in the training process, which may require tedious and error-prone manual tuning. This makes it challenging to use RL for new problems, while also limits its full potential. In many other areas of machine learning, AutoML has shown it is possible to automate such design choices and has also yielded promising initial results when applied to RL. However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go. Given the diversity of methods and environments considered in RL, much of the research has been conducted in distinct subfields, ranging from meta-learning to evolution. In this survey we seek to unify the field of AutoRL, we provide a common taxonomy, discuss each area in detail and pose open problems which would be of interest to researchers going forward. submitted by /u/machinelearner5000 [link] [comments]
    [D] Edit Videos With CLIP - StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 by Ivan Skorokhodov et al. explained in 5 minutes (by Casual GAN Papers)
    (1 min) While we have seen several new SOTA image generation models pop up over the last year, video generation still remains lackluster, to say the least. But does it have to be? The authors of StyleGAN-V certainly don’t think so! By adapting the generator from StyleGAN2 to work with motion conditions, developing a hypernetwork-based discriminator, and designing a clever acyclic positional encoding, Ivan Skorohodov and the team at KAUST and Snap Inc. deliver a model that generates videos of arbitrary length with arbitrary framerate, is just 5% more expensive to train than a vanilla StyleGAN2, and beats multiple baseline models on 256 and 1024 resolution. Oh, and it only needs to see about 2 frames from a video during training to do so! And if that wasn’t impressive enough, StyleGAN-V is CLIP-compatible for first-ever text-based consistent video editing Full summary: https://t.me/casual_gan/238 Blog post: https://www.casualganpapers.com/text_guided_video_editing_hd_video_generation/StyleGAN-V-explained.html StyleGAN-V: generate hd videos and edit them with CLIP arxiv / code (coming soon) Subscribe to Casual GAN Papers and follow me on Twitter for weekly AI paper summaries! submitted by /u/KirillTheMunchKing [link] [comments]
    [D] Real-time Public data API
    (0 min) I am testing a featured related with data pipeline. To produce a more realistic demo I started searching for a dynamic/real-time public data that I can build a end-to-end project. Best if it is a finance related data. Stock limited order book etc submitted by /u/Late-trip-to-cabo [link] [comments]
  • Reinforcement Learning

    Roman Ring (DeepMind) talks StarCraft, AlphaStar on TalkRL?
    (1 min) submitted by /u/Infamous-Editor5131 [link] [comments]
    Need explanation of some basic code in Spinningup : Spinup utils
    (1 min) Hi. I have a basic query regarding, code in spinup utils. In file test_policy.py, L87. get_action = lambda x : sess.run(action_op, feed_dict={model['x']: x[None,:]})[0] What does the 'x[None,:]' part do/mean? Sorry, I am a beginner and this might be a trivial question. Would appreciate any possible help. submitted by /u/underconfidant_soul [link] [comments]
    Are neural networks necessary to update the actor-critic model parameters ? Is there other method to perform the same .
    (1 min) submitted by /u/aabra__ka__daabra [link] [comments]
    Is Vanilla Policy Gradient algorithm better than Advantage Actor Critic (A2C) ?
    (1 min) Hi! I've been studying Policy Gradient algorithms and implementing them and I've found that with the same amount of data (~5000 samples in total over multiple episodes, 50 "epochs") the simple and basic VPG algorithm with "reward-to-go" function converges to average return of 200 on CartPole-v0 env much, MUCH faster than A2C. Is that how it's supposed to be? Shouldn't the A2C version (with less variance, as the authors claim) be better at convergence? Edit: I've just found a bug in my code, which has improved my results, but still made it comparable to simple VPG. Edit 2: I apologize for even asking the question... It seems that playing around with hyper parameters can change situation drastically. submitted by /u/Decent-Ad9135 [link] [comments]
    Ways for representing environments
    (1 min) Hello there :D, I am currently working on a RL environment of a agent (robot/drone) moving around in the environment to look for a target. Currently I use PPO with Convolutional nets (CNN), where the observations are maps of the area (something like grid world), showing where walls, agents positions, etc are. In a certain timestep the agent collects a stack of maps (3-4) that are fed to the network. I want to move to using LSTM, and I was wondering what other ways can I use to model the observations? Keep in mind my observations should contain agents positions (an agent observes its own location and the locations of other agents) and obstacles in the environment (walls determined by position and dimensions). Is there a better way of modeling these than CNN? I though of using a normal feed forward network, and just feed positions directly and use embeddings for the walls, but I'm not sure how scalable that would be for increasing number of agents (more agents = more input), while in CNN the map dimensions stays the same and just more pixels are set to 1 indicating the agents' locations. Any insights? submitted by /u/AhmedNizam_ [link] [comments]
    [D] Interview - This Team won the Minecraft RL BASALT Challenge! (Paper Explanation & Interview with the authors)
    (1 min) submitted by /u/gwern [link] [comments]

2022-01-11

  • cs.LG updates on arXiv.org

    Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. (arXiv:2106.13008v5 [cs.LG] UPDATED)
    (2 min) Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease. Code is available at this repository: \url{https://github.com/thuml/Autoformer}.
    Time Series Anomaly Detection for Cyber-Physical Systems via Neural System Identification and Bayesian Filtering. (arXiv:2106.07992v2 [cs.LG] UPDATED)
    (2 min) Recent advances in AIoT technologies have led to an increasing popularity of utilizing machine learning algorithms to detect operational failures for cyber-physical systems (CPS). In its basic form, an anomaly detection module monitors the sensor measurements and actuator states from the physical plant, and detects anomalies in these measurements to identify abnormal operation status. Nevertheless, building effective anomaly detection models for CPS is rather challenging as the model has to accurately detect anomalies in presence of highly complicated system dynamics and unknown amount of sensor noise. In this work, we propose a novel time series anomaly detection method called Neural System Identification and Bayesian Filtering (NSIBF) in which a specially crafted neural network architecture is posed for system identification, i.e., capturing the dynamics of CPS in a dynamical state-space model; then a Bayesian filtering algorithm is naturally applied on top of the "identified" state-space model for robust anomaly detection by tracking the uncertainty of the hidden state of the system recursively over time. We provide qualitative as well as quantitative experiments with the proposed method on a synthetic and three real-world CPS datasets, showing that NSIBF compares favorably to the state-of-the-art methods with considerable improvements on anomaly detection in CPS.
    SpinalNet: Deep Neural Network with Gradual Input. (arXiv:2007.03347v3 [cs.CV] UPDATED)
    (2 min) Deep neural networks (DNNs) have achieved the state of the art performance in numerous fields. However, DNNs need high computation times, and people always expect better performance in a lower computation. Therefore, we study the human somatosensory system and design a neural network (SpinalNet) to achieve higher accuracy with fewer computations. Hidden layers in traditional NNs receive inputs in the previous layer, apply activation function, and then transfer the outcomes to the next layer. In the proposed SpinalNet, each layer is split into three splits: 1) input split, 2) intermediate split, and 3) output split. Input split of each layer receives a part of the inputs. The intermediate split of each layer receives outputs of the intermediate split of the previous layer and outputs of the input split of the current layer. The number of incoming weights becomes significantly lower than traditional DNNs. The SpinalNet can also be used as the fully connected or classification layer of DNN and supports both traditional learning and transfer learning. We observe significant error reductions with lower computational costs in most of the DNNs. Traditional learning on the VGG-5 network with SpinalNet classification layers provided the state-of-the-art (SOTA) performance on QMNIST, Kuzushiji-MNIST, EMNIST (Letters, Digits, and Balanced) datasets. Traditional learning with ImageNet pre-trained initial weights and SpinalNet classification layers provided the SOTA performance on STL-10, Fruits 360, Bird225, and Caltech-101 datasets. The scripts of the proposed SpinalNet are available at the following link: https://github.com/dipuk0506/SpinalNet
    Collaborating with Humans without Human Data. (arXiv:2110.08176v2 [cs.LG] UPDATED)
    (2 min) Collaborating with humans requires rapidly adapting to their individual strengths, weaknesses, and preferences. Unfortunately, most standard multi-agent reinforcement learning techniques, such as self-play (SP) or population play (PP), produce agents that overfit to their training partners and do not generalize well to humans. Alternatively, researchers can collect human data, train a human model using behavioral cloning, and then use that model to train "human-aware" agents ("behavioral cloning play", or BCP). While such an approach can improve the generalization of agents to new human co-players, it involves the onerous and expensive step of collecting large amounts of human data first. Here, we study the problem of how to train agents that collaborate well with human partners without using human data. We argue that the crux of the problem is to produce a diverse set of training partners. Drawing inspiration from successful multi-agent approaches in competitive domains, we find that a surprisingly simple approach is highly effective. We train our agent partner as the best response to a population of self-play agents and their past checkpoints taken throughout training, a method we call Fictitious Co-Play (FCP). Our experiments focus on a two-player collaborative cooking simulator that has recently been proposed as a challenge problem for coordination with humans. We find that FCP agents score significantly higher than SP, PP, and BCP when paired with novel agent and human partners. Furthermore, humans also report a strong subjective preference to partnering with FCP agents over all baselines.
    Machine-learning-based arc selection for constrained shortest path problems in column generation. (arXiv:2201.02535v1 [math.OC])
    (2 min) Column generation is an iterative method used to solve a variety of optimization problems. It decomposes the problem into two parts: a master problem, and one or more pricing problems (PP). The total computing time taken by the method is divided between these two parts. In routing or scheduling applications, the problems are mostly defined on a network, and the PP is usually an NP-hard shortest path problem with resource constraints. In this work, we propose a new heuristic pricing algorithm based on machine learning. By taking advantage of the data collected during previous executions, the objective is to reduce the size of the network and accelerate the PP, keeping only the arcs that have a high chance to be part of the linear relaxation solution. The method has been applied to two specific problems: the vehicle and crew scheduling problem in public transit and the vehicle routing problem with time windows. Reductions in computational time of up to 40% can be obtained.
    Community recovery in non-binary and temporal stochastic block models. (arXiv:2008.04790v3 [math.ST] UPDATED)
    (2 min) This article studies the estimation of community memberships from non-binary pair interactions represented by an $N$-by-$N$ tensor whose values are elements of $\mathcal S$, where $N$ is the number of nodes and $\mathcal S$ is the space of the pairwise interactions between the nodes. As an information-theoretic benchmark, we study data sets generated by a non-binary stochastic block model, and derive fundamental information criteria for the recovery of the community memberships as $N \to \infty$. Examples of applications include weighted networks ($\mathcal S = \mathbb R$), link-labeled networks $(\mathcal S = \{0, 1, \dots, L\}$), multiplex networks $(\mathcal S = \{0,1\}^M$) and temporal networks ($\mathcal S = \{0,1\}^T$). For temporal interactions, we show that (i) even a small increase in $T$ may have a big impact on the recovery of community memberships, (ii) consistent recovery is possible even for very sparse data (e.g.\ bounded average degree) when $T$ is large enough. We also present several estimation algorithms, both offline and online, which fully utilise the temporal nature of the observed data. We analyse the accuracy of the proposed estimation algorithms under various assumptions on data sparsity and identifiability. Numerical experiments show that even a poor initial estimate (e.g., blind random guess) of the community assignment leads to high accuracy obtained by the online algorithm after a small number of iterations, and remarkably so also in very sparse regimes.
    Acoustic Neighbor Embeddings. (arXiv:2007.10329v5 [eess.AS] UPDATED)
    (2 min) This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a triplet loss criterion, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results with low-dimensional embeddings when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic matching task. In particular, in an isolated name recognition task depending solely on Euclidean nearest-neighbor search between the proposed embedding vectors, the recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
    Deep Reinforcement Learning. (arXiv:2201.02135v1 [cs.AI] CROSS LISTED)
    (2 min) Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. In all these fields, computer programs have taught themselves to solve difficult problems. They have learned to fly model helicopters and perform aerobatic manoeuvers such as loops and rolls. In some applications they have even become better than the best humans, such as in Atari, Go, poker and StarCraft. The way in which deep reinforcement learning explores complex environments reminds us of how children learn, by playfully trying out things, getting feedback, and trying again. The computer seems to truly possess aspects of human learning; this goes to the heart of the dream of artificial intelligence. The successes in research have not gone unnoticed by educators, and universities have started to offer courses on the subject. The aim of this book is to provide a comprehensive overview of the field of deep reinforcement learning. The book is written for graduate students of artificial intelligence, and for researchers and practitioners who wish to better understand deep reinforcement learning methods and their challenges. We assume an undergraduate-level of understanding of computer science and artificial intelligence; the programming language of this book is Python. We describe the foundations, the algorithms and the applications of deep reinforcement learning. We cover the established model-free and model-based methods that form the basis of the field. Developments go quickly, and we also cover advanced topics: deep multi-agent reinforcement learning, deep hierarchical reinforcement learning, and deep meta learning.
    Weighted Graph-Based Signal Temporal Logic Inference Using Neural Networks. (arXiv:2109.08078v2 [cs.AI] UPDATED)
    (2 min) Extracting spatial-temporal knowledge from data is useful in many applications. It is important that the obtained knowledge is human-interpretable and amenable to formal analysis. In this paper, we propose a method that trains neural networks to learn spatial-temporal properties in the form of weighted graph-based signal temporal logic (wGSTL) formulas. For learning wGSTL formulas, we introduce a flexible wGSTL formula structure in which the user's preference can be applied in the inferred wGSTL formulas. In the proposed framework, each neuron of the neural networks corresponds to a subformula in a flexible wGSTL formula structure. We initially train a neural network to learn the wGSTL operators and then train a second neural network to learn the parameters in a flexible wGSTL formula structure. We use a COVID-19 dataset and a rain prediction dataset to evaluate the performance of the proposed framework and algorithms. We compare the performance of the proposed framework with three baseline classification methods including K-nearest neighbors, decision trees, support vector machine, and artificial neural networks. The classification accuracy obtained by the proposed framework is comparable with the baseline classification methods.
    AugmentedPCA: A Python Package of Supervised and Adversarial Linear Factor Models. (arXiv:2201.02547v1 [stat.ML])
    (2 min) Deep autoencoders are often extended with a supervised or adversarial loss to learn latent representations with desirable properties, such as greater predictivity of labels and outcomes or fairness with respects to a sensitive variable. Despite the ubiquity of supervised and adversarial deep latent factor models, these methods should demonstrate improvement over simpler linear approaches to be preferred in practice. This necessitates a reproducible linear analog that still adheres to an augmenting supervised or adversarial objective. We address this methodological gap by presenting methods that augment the principal component analysis (PCA) objective with either a supervised or an adversarial objective and provide analytic and reproducible solutions. We implement these methods in an open-source Python package, AugmentedPCA, that can produce excellent real-world baselines. We demonstrate the utility of these factor models on an open-source, RNA-seq cancer gene expression dataset, showing that augmenting with a supervised objective results in improved downstream classification performance, produces principal components with greater class fidelity, and facilitates identification of genes aligned with the principal axes of data variance with implications to development of specific types of cancer.
    Quantum Unsupervised and Supervised Learning on Superconducting Processors. (arXiv:1909.04226v2 [quant-ph] UPDATED)
    (2 min) Machine learning algorithms perform well on identifying patterns in many different datasets due to their versatility. However, as one increases the size of the dataset, the computation time for training and using these statistical models grows quickly. Quantum computing offers a new paradigm which may have the ability to overcome these computational difficulties. Here, we propose a quantum analogue to K-means clustering, implement it on simulated superconducting qubits, and compare it to a previously developed quantum support vector machine. We find the algorithm's accuracy comparable to the classical K-means algorithm for clustering and classification problems, and find that it has asymptotic complexity $O(N^{3/2}K^{1/2}\log{P})$, where $N$ is the number of data points, $K$ is the number of clusters, and $P$ is the dimension of the data points, giving a significant speedup over the classical analogue.
    E-GraphSAGE: A Graph Neural Network based Intrusion Detection System for IoT. (arXiv:2103.16329v7 [cs.NI] UPDATED)
    (2 min) This paper presents a new Network Intrusion Detection System (NIDS) based on Graph Neural Networks (GNNs). GNNs are a relatively new sub-field of deep neural networks, which can leverage the inherent structure of graph-based data. Training and evaluation data for NIDSs are typically represented as flow records, which can naturally be represented in a graph format. In this paper, we propose E-GraphSAGE, a GNN approach that allows capturing both the edge features of a graph as well as the topological information for network intrusion detection in IoT networks. To the best of our knowledge, our proposal is the first successful, practical, and extensively evaluated approach of applying GNNs on the problem of network intrusion detection for IoT using flow-based data. Our extensive experimental evaluation on four recent NIDS benchmark datasets shows that our approach outperforms the state-of-the-art in terms of key classification metrics, which demonstrates the potential of GNNs in network intrusion detection, and provides motivation for further research.
    Exploratory Lagrangian-Based Particle Tracing Using Deep Learning. (arXiv:2110.08338v2 [cs.LG] UPDATED)
    (2 min) Time-varying vector fields produced by computational fluid dynamics simulations are often prohibitively large and pose challenges for accurate interactive analysis and exploration. To address these challenges, reduced Lagrangian representations have been increasingly researched as a means to improve scientific time-varying vector field exploration capabilities. This paper presents a novel deep neural network-based particle tracing method to explore time-varying vector fields represented by Lagrangian flow maps. In our workflow, in situ processing is first utilized to extract Lagrangian flow maps, and deep neural networks then use the extracted data to learn flow field behavior. Using a trained model to predict new particle trajectories offers a fixed small memory footprint and fast inference. To demonstrate and evaluate the proposed method, we perform an in-depth study of performance using a well-known analytical data set, the Double Gyre. Our study considers two flow map extraction strategies as well as the impact of the number of training samples and integration durations on efficacy, evaluates multiple sampling options for training and testing and informs hyperparameter settings. Overall, we find our method requires a fixed memory footprint of 10.5 MB to encode a Lagrangian representation of a time-varying vector field while maintaining accuracy. For post hoc analysis, loading the trained model costs only two seconds, significantly reducing the burden of I/O when reading data for visualization. Moreover, our parallel implementation can infer one hundred locations for each of two thousand new pathlines across the entire temporal resolution in 1.3 seconds using one NVIDIA Titan RTX GPU.
    Optimality in Noisy Importance Sampling. (arXiv:2201.02432v1 [stat.ML])
    (2 min) In this work, we analyze the noisy importance sampling (IS), i.e., IS working with noisy evaluations of the target density. We present the general framework and derive optimal proposal densities for noisy IS estimators. The optimal proposals incorporate the information of the variance of the noisy realizations, proposing points in regions where the noise power is higher. We also compare the use of the optimal proposals with previous optimality approaches considered in a noisy IS framework.
    Connecting Weighted Automata, Tensor Networks and Recurrent Neural Networks through Spectral Learning. (arXiv:2010.10029v2 [cs.LG] UPDATED)
    (2 min) In this paper, we present connections between three models used in different research fields: weighted finite automata~(WFA) from formal languages and linguistics, recurrent neural networks used in machine learning, and tensor networks which encompasses a set of optimization techniques for high-order tensors used in quantum physics and numerical analysis. We first present an intrinsic relation between WFA and the tensor train decomposition, a particular form of tensor network. This relation allows us to exhibit a novel low rank structure of the Hankel matrix of a function computed by a WFA and to design an efficient spectral learning algorithm leveraging this structure to scale the algorithm up to very large Hankel matrices.We then unravel a fundamental connection between WFA and second-orderrecurrent neural networks~(2-RNN): in the case of sequences of discrete symbols, WFA and 2-RNN with linear activationfunctions are expressively equivalent. Leveraging this equivalence result combined with the classical spectral learning algorithm for weighted automata, we introduce the first provable learning algorithm for linear 2-RNN defined over sequences of continuous input vectors.This algorithm relies on estimating low rank sub-blocks of the Hankel tensor, from which the parameters of a linear 2-RNN can be provably recovered. The performances of the proposed learning algorithm are assessed in a simulation study on both synthetic and real-world data.
    Path classification by stochastic linear recurrent neural networks. (arXiv:2108.03090v2 [stat.ML] UPDATED)
    (2 min) We investigate the functioning of a classifying biological neural network from the perspective of statistical learning theory, modelled, in a simplified setting, as a continuous-time stochastic recurrent neural network (RNN) with identity activation function. In the purely stochastic (robust) regime, we give a generalisation error bound that holds with high probability, thus showing that the empirical risk minimiser is the best-in-class hypothesis. We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks. We argue that these RNNs are easy to train and robust and back these observations with numerical experiments on both synthetic and real data. We also exhibit a trade-off phenomenon between accuracy and robustness.
    A framework for deep learning emulation of numerical models with a case study in satellite remote sensing. (arXiv:1910.13408v2 [cs.LG] UPDATED)
    (3 min) Numerical models based on physics represent the state-of-the-art in earth system modeling and comprise our best tools for generating insights and predictions. Despite rapid growth in computational power, the perceived need for higher model resolutions overwhelms the latest-generation computers, reducing the ability of modelers to generate simulations for understanding parameter sensitivities and characterizing variability and uncertainty. Thus, surrogate models are often developed to capture the essential attributes of the full-blown numerical models. Recent successes of machine learning methods, especially deep learning, across many disciplines offer the possibility that complex nonlinear connectionist representations may be able to capture the underlying complex structures and nonlinear processes in earth systems. A difficult test for deep learning-based emulation, which refers to function approximation of numerical models, is to understand whether they can be comparable to traditional forms of surrogate models in terms of computational efficiency while simultaneously reproducing model results in a credible manner. A deep learning emulation that passes this test may be expected to perform even better than simple models with respect to capturing complex processes and spatiotemporal dependencies. Here we examine, with a case study in satellite-based remote sensing, the hypothesis that deep learning approaches can credibly represent the simulations from a surrogate model with comparable computational efficiency. Our results are encouraging in that the deep learning emulation reproduces the results with acceptable accuracy and often even faster performance. We discuss the broader implications of our results in light of the pace of improvements in high-performance implementations of deep learning as well as the growing desire for higher-resolution simulations in the earth sciences.
    Quantitative Performance Assessment of CNN Units via Topological Entropy Calculation. (arXiv:2103.09716v2 [cs.CV] UPDATED)
    (2 min) Identifying the status of individual network units is critical for understanding the mechanism of convolutional neural networks (CNNs). However, it is still challenging to reliably give a general indication of unit status, especially for units in different network models. To this end, we propose a novel method for quantitatively clarifying the status of single unit in CNN using algebraic topological tools. Unit status is indicated via the calculation of a defined topological-based entropy, called feature entropy, which measures the degree of chaos of the global spatial pattern hidden in the unit for a category. In this way, feature entropy could provide an accurate indication of status for units in different networks with diverse situations like weight-rescaling operation. Further, we show that feature entropy decreases as the layer goes deeper and shares almost simultaneous trend with loss during training. We show that by investigating the feature entropy of units on only training data, it could give discrimination between networks with different generalization ability from the view of the effectiveness of feature representations.
    QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits. (arXiv:2107.10845v5 [quant-ph] UPDATED)
    (3 min) Quantum noise is the key challenge in Noisy Intermediate-Scale Quantum (NISQ) computers. Previous work for mitigating noise has primarily focused on gate-level or pulse-level noise-adaptive compilation. However, limited research efforts have explored a higher level of optimization by making the quantum circuits themselves resilient to noise. We propose QuantumNAS, a comprehensive framework for noise-adaptive co-search of the variational circuit and qubit mapping. Variational quantum circuits are a promising approach for constructing QML and quantum simulation. However, finding the best variational circuit and its optimal parameters is challenging due to the large design space and parameter training cost. We propose to decouple the circuit search and parameter training by introducing a novel SuperCircuit. The SuperCircuit is constructed with multiple layers of pre-defined parameterized gates and trained by iteratively sampling and updating the parameter subsets (SubCircuits) of it. It provides an accurate estimation of SubCircuits performance trained from scratch. Then we perform an evolutionary co-search of SubCircuit and its qubit mapping. The SubCircuit performance is estimated with parameters inherited from SuperCircuit and simulated with real device noise models. Finally, we perform iterative gate pruning and finetuning to remove redundant gates. Extensively evaluated with 12 QML and VQE benchmarks on 14 quantum computers, QuantumNAS significantly outperforms baselines. For QML, QuantumNAS is the first to demonstrate over 95% 2-class, 85% 4-class, and 32% 10-class classification accuracy on real QC. It also achieves the lowest eigenvalue for VQE tasks on H2, H2O, LiH, CH4, BeH2 compared with UCCSD. We also open-source TorchQuantum (https://github.com/mit-han-lab/torchquantum) for fast training of parameterized quantum circuits to facilitate future research.
    Comprehensive RF Dataset Collection and Release: A Deep Learning-Based Device Fingerprinting Use Case. (arXiv:2201.02213v1 [eess.SP])
    (2 min) Deep learning-based RF fingerprinting has recently been recognized as a potential solution for enabling newly emerging wireless network applications, such as spectrum access policy enforcement, automated network device authentication, and unauthorized network access monitoring and control. Real, comprehensive RF datasets are now needed more than ever to enable the study, assessment, and validation of newly developed RF fingerprinting approaches. In this paper, we present and release a large-scale RF fingerprinting dataset, collected from 25 different LoRa-enabled IoT transmitting devices using USRP B210 receivers. Our dataset consists of a large number of SigMF-compliant binary files representing the I/Q time-domain samples and their corresponding FFT-based files of LoRa transmissions. This dataset provides a comprehensive set of essential experimental scenarios, considering both indoor and outdoor environments and various network deployments and configurations, such as the distance between the transmitters and the receiver, the configuration of the considered LoRa modulation, the physical location of the conducted experiment, and the receiver hardware used for training and testing the neural network models.
    Multi-Model Federated Learning. (arXiv:2201.02582v1 [cs.LG])
    (2 min) Federated learning is a form of distributed learning with the key challenge being the non-identically distributed nature of the data in the participating clients. In this paper, we extend federated learning to the setting where multiple unrelated models are trained simultaneously. Specifically, every client is able to train any one of M models at a time and the server maintains a model for each of the M models which is typically a suitably averaged version of the model computed by the clients. We propose multiple policies for assigning learning tasks to clients over time. In the first policy, we extend the widely studied FedAvg to multi-model learning by allotting models to clients in an i.i.d. stochastic manner. In addition, we propose two new policies for client selection in a multi-model federated setting which make decisions based on current local losses for each client-model pair. We compare the performance of the policies on tasks involving synthetic and real-world data and characterize the performance of the proposed policies. The key take-away from our work is that the proposed multi-model policies perform better or at least as good as single model training using FedAvg.
    Neural calibration of hidden inhomogeneous Markov chains -- Information decompression in life insurance. (arXiv:2201.02397v1 [cs.LG])
    (0 min) Markov chains play a key role in a vast number of areas, including life insurance mathematics. Standard actuarial quantities as the premium value can be interpreted as compressed, lossy information about the underlying Markov process. We introduce a method to reconstruct the underlying Markov chain given collective information of a portfolio of contracts. Our neural architecture explainably characterizes the process by explicitly providing one-step transition probabilities. Further, we provide an intrinsic, economic model validation to inspect the quality of the information decompression. Lastly, our methodology is successfully tested for a realistic data set of German term life insurance contracts.
    Negative Evidence Matters in Interpretable Histology Image Classification. (arXiv:2201.02445v1 [eess.IV])
    (0 min) Using only global annotations such as the image class labels, weakly-supervised learning methods allow CNN classifiers to jointly classify an image, and yield the regions of interest associated with the predicted class. However, without any guidance at the pixel level, such methods may yield inaccurate regions. This problem is known to be more challenging with histology images than with natural ones, since objects are less salient, structures have more variations, and foreground and background regions have stronger similarities. Therefore, methods in computer vision literature for visual interpretation of CNNs may not directly apply. In this work, we propose a simple yet efficient method based on a composite loss function that leverages information from the fully negative samples. Our new loss function contains two complementary terms: the first exploits positive evidence collected from the CNN classifier, while the second leverages the fully negative samples from the training dataset. In particular, we equip a pre-trained classifier with a decoder that allows refining the regions of interest. The same classifier is exploited to collect both the positive and negative evidence at the pixel level to train the decoder. This enables to take advantages of the fully negative samples that occurs naturally in the data, without any additional supervision signals and using only the image class as supervision. Compared to several recent related methods, over the public benchmark GlaS for colon cancer and a Camelyon16 patch-based benchmark for breast cancer using three different backbones, we show the substantial improvements introduced by our method. Our results shows the benefits of using both negative and positive evidence, ie, the one obtained from a classifier and the one naturally available in datasets. We provide an ablation study of both terms. Our code is publicly available.
    MGAE: Masked Autoencoders for Self-Supervised Learning on Graphs. (arXiv:2201.02534v1 [cs.LG])
    (2 min) We introduce a novel masked graph autoencoder (MGAE) framework to perform effective learning on graph structure data. Taking insights from self-supervised learning, we randomly mask a large proportion of edges and try to reconstruct these missing edges during training. MGAE has two core designs. First, we find that masking a high ratio of the input graph structure, e.g., $70\%$, yields a nontrivial and meaningful self-supervisory task that benefits downstream applications. Second, we employ a graph neural network (GNN) as an encoder to perform message propagation on the partially-masked graph. To reconstruct the large number of masked edges, a tailored cross-correlation decoder is proposed. It could capture the cross-correlation between the head and tail nodes of anchor edge in multi-granularity. Coupling these two designs enables MGAE to be trained efficiently and effectively. Extensive experiments on multiple open datasets (Planetoid and OGB benchmarks) demonstrate that MGAE generally performs better than state-of-the-art unsupervised learning competitors on link prediction and node classification.
    Towards Lower Bounds on the Depth of ReLU Neural Networks. (arXiv:2105.14835v3 [cs.LG] UPDATED)
    (2 min) We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). This problem has potential impact on algorithmic and statistical aspects because of the insight it provides into the class of functions represented by neural hypothesis classes. However, to the best of our knowledge, this question has not been investigated in the neural network literature. We also present upper bounds on the sizes of neural networks required to represent functions in these neural hypothesis classes.
    Budget-aware Few-shot Learning via Graph Convolutional Network. (arXiv:2201.02304v1 [cs.CV])
    (2 min) This paper tackles the problem of few-shot learning, which aims to learn new visual concepts from a few examples. A common problem setting in few-shot classification assumes random sampling strategy in acquiring data labels, which is inefficient in practical applications. In this work, we introduce a new budget-aware few-shot learning problem that not only aims to learn novel object categories, but also needs to select informative examples to annotate in order to achieve data efficiency. We develop a meta-learning strategy for our budget-aware few-shot learning task, which jointly learns a novel data selection policy based on a Graph Convolutional Network (GCN) and an example-based few-shot classifier. Our selection policy computes a context-sensitive representation for each unlabeled data by graph message passing, which is then used to predict an informativeness score for sequential selection. We validate our method by extensive experiments on the mini-ImageNet, tiered-ImageNet and Omniglot datasets. The results show our few-shot learning strategy outperforms baselines by a sizable margin, which demonstrates the efficacy of our method.
    Lattice-Based Methods Surpass Sum-of-Squares in Clustering. (arXiv:2112.03898v2 [cs.LG] UPDATED)
    (2 min) Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. '20; Mao, Wein '21; Davis, Diaz, Wang '21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, even though the aforementioned low-degree and SoS lower bounds continue to apply in this case. To achieve this, we give a polynomial-time algorithm based on the Lenstra--Lenstra--Lovasz lattice basis reduction method which achieves the statistically-optimal sample complexity of d+1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be "closed" by "brittle" polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps.
    Binary matrix factorization on special purpose hardware. (arXiv:2010.08693v2 [cs.LG] UPDATED)
    (2 min) Many fundamental problems in data mining can be reduced to one or more NP-hard combinatorial optimization problems. Recent advances in novel technologies such as quantum and quantum-inspired hardware promise a substantial speedup for solving these problems compared to when using general purpose computers but often require the problem to be modeled in a special form, such as an Ising or quadratic unconstrained binary optimization (QUBO) model, in order to take advantage of these devices. In this work, we focus on the important binary matrix factorization (BMF) problem which has many applications in data mining. We propose two QUBO formulations for BMF. We show how clustering constraints can easily be incorporated into these formulations. The special purpose hardware we consider is limited in the number of variables it can handle which presents a challenge when factorizing large matrices. We propose a sampling based approach to overcome this challenge, allowing us to factorize large rectangular matrices. In addition to these methods, we also propose a simple baseline algorithm which outperforms our more sophisticated methods in a few situations. We run experiments on the Fujitsu Digital Annealer, a quantum-inspired complementary metal-oxide-semiconductor (CMOS) annealer, on both synthetic and real data, including gene expression data. These experiments show that our approach is able to produce more accurate BMFs than competing methods.
    Generalized quantum similarity learning. (arXiv:2201.02310v1 [quant-ph])
    (2 min) The similarity between objects is significant in a broad range of areas. While similarity can be measured using off-the-shelf distance functions, they may fail to capture the inherent meaning of similarity, which tends to depend on the underlying data and task. Moreover, conventional distance functions limit the space of similarity measures to be symmetric and do not directly allow comparing objects from different spaces. We propose using quantum networks (GQSim) for learning task-dependent (a)symmetric similarity between data that need not have the same dimensionality. We analyze the properties of such similarity function analytically (for a simple case) and numerically (for a complex case) and showthat these similarity measures can extract salient features of the data. We also demonstrate that the similarity measure derived using this technique is $(\epsilon,\gamma,\tau)$-good, resulting in theoretically guaranteed performance. Finally, we conclude by applying this technique for three relevant applications - Classification, Graph Completion, Generative modeling.
    Applications of Signature Methods to Market Anomaly Detection. (arXiv:2201.02441v1 [q-fin.CP])
    (2 min) Anomaly detection is the process of identifying abnormal instances or events in data sets which deviate from the norm significantly. In this study, we propose a signatures based machine learning algorithm to detect rare or unexpected items in a given data set of time series type. We present applications of signature or randomized signature as feature extractors for anomaly detection algorithms; additionally we provide an easy, representation theoretic justification for the construction of randomized signatures. Our first application is based on synthetic data and aims at distinguishing between real and fake trajectories of stock prices, which are indistinguishable by visual inspection. We also show a real life application by using transaction data from the cryptocurrency market. In this case, we are able to identify pump and dump attempts organized on social networks with F1 scores up to 88% by means of our unsupervised learning algorithm, thus achieving results that are close to the state-of-the-art in the field based on supervised learning.
    Detecting Renewal States in Chains of Variable Length via Intrinsic Bayes Factors. (arXiv:2110.07430v2 [cs.LG] UPDATED)
    (2 min) Markov chains with variable length are useful parsimonious stochastic models able to generate most stationary sequence of discrete symbols. The idea is to identify the suffixes of the past, called contexts, that are relevant to predict the future symbol. Sometimes a single state is a context, and looking at the past and finding this specific state makes the further past irrelevant. States with such property are called renewal states and they can be used to split the chain into independent and identically distributed blocks. In order to identify renewal states for chains with variable length, we propose the use of Intrinsic Bayes Factor to evaluate the hypothesis that some particular state is a renewal state. In this case, the difficulty lies in integrating the marginal posterior distribution for the random context trees for general prior distribution on the space of context trees, with Dirichlet prior for the transition probabilities, and Monte Carlo methods are applied. To show the strength of our method, we analyzed artificial datasets generated from different binary models models and one example coming from the field of Linguistics.
    ITSA: An Information-Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks. (arXiv:2201.02263v1 [cs.CV])
    (2 min) State-of-the-art stereo matching networks trained only on synthetic data often fail to generalize to more challenging real data domains. In this paper, we attempt to unfold an important factor that hinders the networks from generalizing across domains: through the lens of shortcut learning. We demonstrate that the learning of feature representations in stereo matching networks is heavily influenced by synthetic data artefacts (shortcut attributes). To mitigate this issue, we propose an Information-Theoretic Shortcut Avoidance~(ITSA) approach to automatically restrict shortcut-related information from being encoded into the feature representations. As a result, our proposed method learns robust and shortcut-invariant features by minimizing the sensitivity of latent features to input variations. To avoid the prohibitive computational cost of direct input sensitivity optimization, we propose an effective yet feasible algorithm to achieve robustness. We show that using this method, state-of-the-art stereo matching networks that are trained purely on synthetic data can effectively generalize to challenging and previously unseen real data scenarios. Importantly, the proposed method enhances the robustness of the synthetic trained networks to the point that they outperform their fine-tuned counterparts (on real data) for challenging out-of-domain stereo datasets.
    GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks. (arXiv:2201.02283v1 [stat.ML])
    (2 min) We develop the "generalized consistent weighted sampling" (GCWS) for hashing the "powered-GMM" (pGMM) kernel (with a tuning parameter $p$). It turns out that GCWS provides a numerically stable scheme for applying power transformation on the original data, regardless of the magnitude of $p$ and the data. The power transformation is often effective for boosting the performance, in many cases considerably so. We feed the hashed data to neural networks on a variety of public classification datasets and name our method ``GCWSNet''. Our extensive experiments show that GCWSNet often improves the classification accuracy. Furthermore, it is evident from the experiments that GCWSNet converges substantially faster. In fact, GCWS often reaches a reasonable accuracy with merely (less than) one epoch of the training process. This property is much desired because many applications, such as advertisement click-through rate (CTR) prediction models, or data streams (i.e., data seen only once), often train just one epoch. Another beneficial side effect is that the computations of the first layer of the neural networks become additions instead of multiplications because the input data become binary (and highly sparse). Empirical comparisons with (normalized) random Fourier features (NRFF) are provided. We also propose to reduce the model size of GCWSNet by count-sketch and develop the theory for analyzing the impact of using count-sketch on the accuracy of GCWS. Our analysis shows that an ``8-bit'' strategy should work well in that we can always apply an 8-bit count-sketch hashing on the output of GCWS hashing without hurting the accuracy much. There are many other ways to take advantage of GCWS when training deep neural networks. For example, one can apply GCWS on the outputs of the last layer to boost the accuracy of trained deep neural networks.
    Sparse PCA on fixed-rank matrices. (arXiv:2201.02487v1 [cs.LG])
    (2 min) Sparse PCA is the optimization problem obtained from PCA by adding a sparsity constraint on the principal components. Sparse PCA is NP-hard and hard to approximate even in the single-component case. In this paper we settle the computational complexity of sparse PCA with respect to the rank of the covariance matrix. We show that, if the rank of the covariance matrix is a fixed value, then there is an algorithm that solves sparse PCA to global optimality, whose running time is polynomial in the number of features. We also prove a similar result for the version of sparse PCA which requires the principal components to have disjoint supports.
    Causal Mediation Analysis with Hidden Confounders. (arXiv:2102.11724v3 [stat.ML] UPDATED)
    (2 min) An important problem in causal inference is to break down the total effect of a treatment on an outcome into different causal pathways and to quantify the causal effect in each pathway. For instance, in causal fairness, the total effect of being a male employee (i.e., treatment) constitutes its direct effect on annual income (i.e., outcome) and the indirect effect via the employee's occupation (i.e., mediator). Causal mediation analysis (CMA) is a formal statistical framework commonly used to reveal such underlying causal mechanisms. One major challenge of CMA in observational studies is handling confounders, variables that cause spurious causal relationships among treatment, mediator, and outcome. Conventional methods assume sequential ignorability that implies all confounders can be measured, which is often unverifiable in practice. This work aims to circumvent the stringent sequential ignorability assumptions and consider hidden confounders. Drawing upon proxy strategies and recent advances in deep learning, we propose to simultaneously uncover the latent variables that characterize hidden confounders and estimate the causal effects. Empirical evaluations using both synthetic and semi-synthetic datasets validate the effectiveness of the proposed method. We further show the potentials of our approach for causal fairness analysis.
    Learning to Transfer with von Neumann Conditional Divergence. (arXiv:2108.03531v2 [cs.LG] UPDATED)
    (2 min) The similarity of feature representations plays a pivotal role in the success of problems related to domain adaptation. Feature similarity includes both the invariance of marginal distributions and the closeness of conditional distributions given the desired response $y$ (e.g., class labels). Unfortunately, traditional methods always learn such features without fully taking into consideration the information in $y$, which in turn may lead to a mismatch of the conditional distributions or the mix-up of discriminative structures underlying data distributions. In this work, we introduce the recently proposed von Neumann conditional divergence to improve the transferability across multiple domains. We show that this new divergence is differentiable and eligible to easily quantify the functional dependence between features and $y$. Given multiple source tasks, we integrate this divergence to capture discriminative information in $y$ and design novel learning objectives assuming those source tasks are observed either simultaneously or sequentially. In both scenarios, we obtain favorable performance against state-of-the-art methods in terms of smaller generalization error on new tasks and less catastrophic forgetting on source tasks (in the sequential setup).
    Time Series Forecasting Using Fuzzy Cognitive Maps: A Survey. (arXiv:2201.02297v1 [cs.AI])
    (0 min) Among various soft computing approaches for time series forecasting, Fuzzy Cognitive Maps (FCM) have shown remarkable results as a tool to model and analyze the dynamics of complex systems. FCM have similarities to recurrent neural networks and can be classified as a neuro-fuzzy method. In other words, FCMs are a mixture of fuzzy logic, neural network, and expert system aspects, which act as a powerful tool for simulating and studying the dynamic behavior of complex systems. The most interesting features are knowledge interpretability, dynamic characteristics and learning capability. The goal of this survey paper is mainly to present an overview on the most relevant and recent FCM-based time series forecasting models proposed in the literature. In addition, this article considers an introduction on the fundamentals of FCM model and learning methodologies. Also, this survey provides some ideas for future research to enhance the capabilities of FCM in order to cover some challenges in the real-world experiments such as handling non-stationary data and scalability issues. Moreover, equipping FCMs with fast learning algorithms is one of the major concerns in this area.
    BDFA: A Blind Data Adversarial Bit-flip Attack on Deep Neural Networks. (arXiv:2112.03477v2 [cs.CR] UPDATED)
    (2 min) Adversarial bit-flip attack (BFA) on Neural Network weights can result in catastrophic accuracy degradation by flipping a very small number of bits. A major drawback of prior bit flip attack techniques is their reliance on test data. This is frequently not possible for applications that contain sensitive or proprietary data. In this paper, we propose Blind Data Adversarial Bit-flip Attack (BDFA), a novel technique to enable BFA without any access to the training or testing data. This is achieved by optimizing for a synthetic dataset, which is engineered to match the statistics of batch normalization across different layers of the network and the targeted label. Experimental results show that BDFA could decrease the accuracy of ResNet50 significantly from 75.96\% to 13.94\% with only 4 bits flips.
    Bayesian Neural Networks for Reversible Steganography. (arXiv:2201.02478v1 [cs.LG])
    (2 min) Recent advances in deep learning have led to a paradigm shift in reversible steganography. A fundamental pillar of reversible steganography is predictive modelling which can be realised via deep neural networks. However, non-trivial errors exist in inferences about some out-of-distribution and noisy data. In view of this issue, we propose to consider uncertainty in predictive models based upon a theoretical framework of Bayesian deep learning. Bayesian neural networks can be regarded as self-aware machinery; that is, a machine that knows its own limitations. To quantify uncertainty, we approximate the posterior predictive distribution through Monte Carlo sampling with stochastic forward passes. We further show that predictive uncertainty can be disentangled into aleatoric and epistemic uncertainties and these quantities can be learnt in an unsupervised manner. Experimental results demonstrate an improvement delivered by Bayesian uncertainty analysis upon steganographic capacity-distortion performance.
    Similarities and Differences between Machine Learning and Traditional Advanced Statistical Modeling in Healthcare Analytics. (arXiv:2201.02469v1 [cs.LG])
    (2 min) Data scientists and statisticians are often at odds when determining the best approach, machine learning or statistical modeling, to solve an analytics challenge. However, machine learning and statistical modeling are more cousins than adversaries on different sides of an analysis battleground. Choosing between the two approaches or in some cases using both is based on the problem to be solved and outcomes required as well as the data available for use and circumstances of the analysis. Machine learning and statistical modeling are complementary, based on similar mathematical principles, but simply using different tools in an overall analytics knowledge base. Determining the predominant approach should be based on the problem to be solved as well as empirical evidence, such as size and completeness of the data, number of variables, assumptions or lack thereof, and expected outcomes such as predictions or causality. Good analysts and data scientists should be well versed in both techniques and their proper application, thereby using the right tool for the right project to achieve the desired results.
    Principal Component Density Estimation for Scenario Generation Using Normalizing Flows. (arXiv:2104.10410v3 [cs.LG] UPDATED)
    (2 min) Neural networks-based learning of the distribution of non-dispatchable renewable electricity generation from sources such as photovoltaics (PV) and wind as well as load demands has recently gained attention. Normalizing flow density models are particularly well suited for this task due to the training through direct log-likelihood maximization. However, research from the field of image generation has shown that standard normalizing flows can only learn smeared-out versions of manifold distributions. Previous works on normalizing flow-based scenario generation do not address this issue, and the smeared-out distributions result in the sampling of noisy time series. In this paper, we exploit the isometry of the principal component analysis (PCA), which sets up the normalizing flow in a lower-dimensional space while maintaining the direct and computationally efficient likelihood maximization. We train the resulting principal component flow (PCF) on data of PV and wind power generation as well as load demand in Germany in the years 2013 to 2015. The results of this investigation show that the PCF preserves critical features of the original distributions, such as the probability density and frequency behavior of the time series. The application of the PCF is, however, not limited to renewable power generation but rather extends to any data set, time series, or otherwise, which can be efficiently reduced using PCA.
    PAC-Bayesian Matrix Completion with a Spectral Scaled Student Prior. (arXiv:2104.08191v2 [stat.ML] UPDATED)
    (2 min) We study the problem of matrix completion in this paper. A spectral scaled Student prior is exploited to favour the underlying low-rank structure of the data matrix. We provide a thorough theoretical investigation for our approach through PAC-Bayesian bounds. More precisely, our PAC-Bayesian approach enjoys a minimax-optimal oracle inequality which guarantees that our method works well under model misspecification and under general sampling distribution. Interestingly, we also provide efficient gradient-based sampling implementations for our approach by using Langevin Monte Carlo. More specifically, we show that our algorithms are significantly faster than Gibbs sampler in this problem. To illustrate the attractive features of our inference strategy, some numerical simulations are conducted and an application to image inpainting is demonstrated.
    Robust Image Retrieval-based Visual Localization using Kapture. (arXiv:2007.13867v3 [cs.CV] UPDATED)
    (2 min) Visual localization tackles the challenge of estimating the camera pose from images by using correspondence analysis between query images and a map. This task is computation and data intensive which poses challenges on thorough evaluation of methods on various datasets. However, in order to further advance in the field, we claim that robust visual localization algorithms should be evaluated on multiple datasets covering a broad domain variety. To facilitate this, we introduce kapture, a new, flexible, unified data format and toolbox for visual localization and structure-from-motion (SFM). It enables easy usage of different datasets as well as efficient and reusable data processing. To demonstrate this, we present a versatile pipeline for visual localization that facilitates the use of different local and global features, 3D data (e.g. depth maps), non-vision sensor data (e.g. IMU, GPS, WiFi), and various processing algorithms. Using multiple configurations of the pipeline, we show the great versatility of kapture in our experiments. Furthermore, we evaluate our methods on eight public datasets where they rank top on all and first on many of them. To foster future research, we release code, models, and all datasets used in this paper in the kapture format open source under a permissive BSD license. github.com/naver/kapture, github.com/naver/kapture-localization
    MIN2Net: End-to-End Multi-Task Learning for Subject-Independent Motor Imagery EEG Classification. (arXiv:2102.03814v4 [eess.SP] UPDATED)
    (2 min) Advances in the motor imagery (MI)-based brain-computer interfaces (BCIs) allow control of several applications by decoding neurophysiological phenomena, which are usually recorded by electroencephalography (EEG) using a non-invasive technique. Despite great advances in MI-based BCI, EEG rhythms are specific to a subject and various changes over time. These issues point to significant challenges to enhance the classification performance, especially in a subject-independent manner. To overcome these challenges, we propose MIN2Net, a novel end-to-end multi-task learning to tackle this task. We integrate deep metric learning into a multi-task autoencoder to learn a compact and discriminative latent representation from EEG and perform classification simultaneously. This approach reduces the complexity in pre-processing, results in significant performance improvement on EEG classification. Experimental results in a subject-independent manner show that MIN2Net outperforms the state-of-the-art techniques, achieving an F1-score improvement of 6.72%, and 2.23% on the SMR-BCI, and OpenBMI datasets, respectively. We demonstrate that MIN2Net improves discriminative information in the latent representation. This study indicates the possibility and practicality of using this model to develop MI-based BCI applications for new users without the need for calibration.
    Auction-Based Ex-Post-Payment Incentive Mechanism Design for Horizontal Federated Learning with Reputation and Contribution Measurement. (arXiv:2201.02410v1 [cs.AI])
    (2 min) Federated learning trains models across devices with distributed data, while protecting the privacy and obtaining a model similar to that of centralized ML. A large number of workers with data and computing power are the foundation of federal learning. However, the inevitable costs prevent self-interested workers from serving for free. Moreover, due to data isolation, task publishers lack effective methods to select, evaluate and pay reliable workers with high-quality data. Therefore, we design an auction-based incentive mechanism for horizontal federated learning with reputation and contribution measurement. By designing a reasonable method of measuring contribution, we establish the reputation of workers, which is easy to decline and difficult to improve. Through reverse auctions, workers bid for tasks, and the task publisher selects workers combining reputation and bid price. With the budget constraint, winning workers are paid based on performance. We proved that our mechanism satisfies the individual rationality of the honest worker, budget feasibility, truthfulness, and computational efficiency.
    Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks. (arXiv:2110.07313v3 [cs.SD] UPDATED)
    (2 min) Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. We combine the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameter-efficient conformer architectures. Our self-supervised pre-training can reduce the need for labeled data by two-thirds. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset through audio-only self-supervised learning. Our fine-tuned conformers also surpass or match the performance of previous systems pre-trained in a supervised way on several downstream tasks. We further discuss the important design considerations for both pre-training and fine-tuning.
    Programmatic Reward Design by Example. (arXiv:2112.08438v2 [cs.LG] UPDATED)
    (2 min) Reward design is a fundamental problem in reinforcement learning (RL). A misspecified or poorly designed reward can result in low sample efficiency and undesired behaviors. In this paper, we propose the idea of programmatic reward design, i.e. using programs to specify the reward functions in RL environments. Programs allow human engineers to express sub-goals and complex task scenarios in a structured and interpretable way. The challenge of programmatic reward design, however, is that while humans can provide the high-level structures, properly setting the low-level details, such as the right amount of reward for a specific sub-task, remains difficult. A major contribution of this paper is a probabilistic framework that can infer the best candidate programmatic reward function from expert demonstrations. Inspired by recent generative-adversarial approaches, our framework searches for the most likely programmatic reward function under which the optimally generated trajectories cannot be differentiated from the demonstrated trajectories. Experimental results show that programmatic reward functionslearned using this framework can significantly outperform those learned using existing reward learning algo-rithms, and enable RL agents to achieve state-of-the-artperformance on highly complex tasks.
    Offline Reinforcement Learning for Road Traffic Control. (arXiv:2201.02381v1 [cs.AI])
    (2 min) Traffic signal control is an important problem in urban mobility with a significant potential of economic and environmental impact. While there is a growing interest in Reinforcement Learning (RL) for traffic control, the work so far has focussed on learning through interactions which, in practice, is costly. Instead, real experience data on traffic is available and could be exploited at minimal costs. Recent progress in offline or batch RL has enabled just that. Model-based offline RL methods, in particular, have been shown to generalize to the experience data much better than others. We build a model-based learning framework, A-DAC, which infers a Markov Decision Process (MDP) from dataset with pessimistic costs built in to deal with data uncertainties. The costs are modeled through an adaptive shaping of rewards in the MDP which provides better regularization of data compared to the prior related work. A-DAC is evaluated on a complex signalized roundabout using multiple datasets varying in size and in batch collection policy. The evaluation results show that it is possible to build high performance control policies in a data efficient manner using simplistic batch collection policies.
    Explainable deep learning for insights in El Nino and river flows. (arXiv:2201.02596v1 [physics.ao-ph])
    (2 min) The El Nino Southern Oscillation (ENSO) is a semi-periodic fluctuation in sea surface temperature (SST) over the tropical central and eastern Pacific Ocean that influences interannual variability in regional hydrology across the world through long-range dependence or teleconnections. Recent research has demonstrated the value of Deep Learning (DL) methods for improving ENSO prediction as well as Complex Networks (CN) for understanding teleconnections. However, gaps in predictive understanding of ENSO-driven river flows include the black box nature of DL, the use of simple ENSO indices to describe a complex phenomenon and translating DL-based ENSO predictions to river flow predictions. Here we show that eXplainable DL (XDL) methods, based on saliency maps, can extract interpretable predictive information contained in global SST and discover novel SST information regions and dependence structures relevant for river flows which, in tandem with climate network constructions, enable improved predictive understanding. Our results reveal additional information content in global SST beyond ENSO indices, develop new understanding of how SSTs influence river flows, and generate improved river flow predictions with uncertainties. Observations, reanalysis data, and earth system model simulations are used to demonstrate the value of the XDL-CN based methods for future interannual and decadal scale climate projections.
    Forecasting emissions through Kaya identity using Neural Ordinary Differential Equations. (arXiv:2201.02433v1 [cs.LG])
    (2 min) Starting from the Kaya identity, we used a Neural ODE model to predict the evolution of several indicators related to carbon emissions, on a country-level: population, GDP per capita, energy intensity of GDP, carbon intensity of energy. We compared the model with a baseline statistical model - VAR - and obtained good performances. We conclude that this machine-learning approach can be used to produce a wide range of results and give relevant insight to policymakers
    Churn prediction in online gambling. (arXiv:2201.02463v1 [cs.LG])
    (2 min) In business retention, churn prevention has always been a major concern. This work contributes to this domain by formalizing the problem of churn prediction in the context of online gambling as a binary classification task. We also propose an algorithmic answer to this problem based on recurrent neural network. This algorithm is tested with online gambling data that have the form of time series, which can be efficiently processed by recurrent neural networks. To evaluate the performances of the trained models, standard machine learning metrics were used, such as accuracy, precision and recall. For this problem in particular, the conducted experiments allowed to assess that the choice of a specific architecture depends on the metric which is given the greatest importance. Architectures using nBRC favour precision, those using LSTM give better recall, while GRU-based architectures allow a higher accuracy and balance two other metrics. Moreover, further experiments showed that using only the more recent time-series histories to train the networks decreases the quality of the results. We also study the performances of models learned at a specific instant $t$, at other times $t^{\prime} > t$. The results show that the performances of the models learned at time $t$ remain good at the following instants $t^{\prime} > t$, suggesting that there is no need to refresh the models at a high rate. However, the performances of the models were subject to noticeable variance due to one-off events impacting the data.
    Linear classifier, least-squares cost function, and outliers. (arXiv:1808.09222v2 [physics.data-an] UPDATED)
    (2 min) A set of introductory notes on the subject of data classification using a linear classifier and least-squares cost function, and the negative effect of the presence of outliers on the decision boundary of the linear discriminant. We also show how a simple scaling could make the outlier less significant, thereby obtaining a much better decision boundary. We present some numerical results.
    iDECODe: In-distribution Equivariance for Conformal Out-of-distribution Detection. (arXiv:2201.02331v1 [cs.LG])
    (2 min) Machine learning methods such as deep neural networks (DNNs), despite their success across different domains, are known to often generate incorrect predictions with high confidence on inputs outside their training distribution. The deployment of DNNs in safety-critical domains requires detection of out-of-distribution (OOD) data so that DNNs can abstain from making predictions on those. A number of methods have been recently developed for OOD detection, but there is still room for improvement. We propose the new method iDECODe, leveraging in-distribution equivariance for conformal OOD detection. It relies on a novel base non-conformity measure and a new aggregation method, used in the inductive conformal anomaly detection framework, thereby guaranteeing a bounded false detection rate. We demonstrate the efficacy of iDECODe by experiments on image and audio datasets, obtaining state-of-the-art results. We also show that iDECODe can detect adversarial examples.
    DiffusionNet: Discretization Agnostic Learning on Surfaces. (arXiv:2012.00888v3 [cs.CV] UPDATED)
    (2 min) We introduce a new general-purpose approach to deep learning on 3D surfaces, based on the insight that a simple diffusion layer is highly effective for spatial communication. The resulting networks are automatically robust to changes in resolution and sampling of a surface -- a basic property which is crucial for practical applications. Our networks can be discretized on various geometric representations such as triangle meshes or point clouds, and can even be trained on one representation then applied to another. We optimize the spatial support of diffusion as a continuous network parameter ranging from purely local to totally global, removing the burden of manually choosing neighborhood sizes. The only other ingredients in the method are a multi-layer perceptron applied independently at each point, and spatial gradient features to support directional filters. The resulting networks are simple, robust, and efficient. Here, we focus primarily on triangle mesh surfaces, and demonstrate state-of-the-art results for a variety of tasks including surface classification, segmentation, and non-rigid correspondence.
    FedDQ: Communication-Efficient Federated Learning with Descending Quantization. (arXiv:2110.02291v3 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) is an emerging privacy-preserving distributed learning scheme. Due to the large model size and frequent model aggregation, FL suffers from critical communication bottleneck. Many techniques have been proposed to reduce the communication volume, including model compression and quantization. Existing adaptive quantization schemes use ascending-trend quantization where the quantizaion level increases with the training stages. In this paper, we formulate the problem as optimizing the training convergence rate for a given communication volume. The result shows that the optimal quantizaiton level can be represented by two factors, i.e., the training loss and the range of model updates, and it is preferable to decrease the quantization level rather than increase. Then, we propose two descending quantization schemes based on the training loss and model range. Experimental results show that proposed schemes not only reduce the communication volume but also help FL converge faster, when compared with current ascending quantization.
    Visual Attention Prediction Improves Performance of Autonomous Drone Racing Agents. (arXiv:2201.02569v1 [cs.RO])
    (2 min) Humans race drones faster than neural networks trained for end-to-end autonomous flight. This may be related to the ability of human pilots to select task-relevant visual information effectively. This work investigates whether neural networks capable of imitating human eye gaze behavior and attention can improve neural network performance for the challenging task of vision-based autonomous drone racing. We hypothesize that gaze-based attention prediction can be an efficient mechanism for visual information selection and decision making in a simulator-based drone racing task. We test this hypothesis using eye gaze and flight trajectory data from 18 human drone pilots to train a visual attention prediction model. We then use this visual attention prediction model to train an end-to-end controller for vision-based autonomous drone racing using imitation learning. We compare the drone racing performance of the attention-prediction controller to those using raw image inputs and image-based abstractions (i.e., feature tracks). Our results show that attention-prediction based controllers outperform the baselines and are able to complete a challenging race track consistently with up to 88% success rate. Furthermore, visual attention-prediction and feature-track based models showed better generalization performance than image-based models when evaluated on hold-out reference trajectories. Our results demonstrate that human visual attention prediction improves the performance of autonomous vision-based drone racing agents and provides an essential step towards vision-based, fast, and agile autonomous flight that eventually can reach and even exceed human performances.
    Assigning function to protein-protein interactions: a weakly supervised BioBERT based approach using PubMed abstracts. (arXiv:2008.08727v3 [cs.CL] UPDATED)
    (2 min) Motivation: Protein-protein interactions (PPI) are critical to the function of proteins in both normal and diseased cells, and many critical protein functions are mediated by interactions.Knowledge of the nature of these interactions is important for the construction of networks to analyse biological data. However, only a small percentage of PPIs captured in protein interaction databases have annotations of function available, e.g. only 4% of PPI are functionally annotated in the IntAct database. Here, we aim to label the function type of PPIs by extracting relationships described in PubMed abstracts. Method: We create a weakly supervised dataset from the IntAct PPI database containing interacting protein pairs with annotated function and associated abstracts from the PubMed database. We apply a state-of-the-art deep learning technique for biomedical natural language processing tasks, BioBERT, to build a model - dubbed PPI-BioBERT - for identifying the function of PPIs. In order to extract high quality PPI functions at large scale, we use an ensemble of PPI-BioBERT models to improve uncertainty estimation and apply an interaction type-specific threshold to counteract the effects of variations in the number of training samples per interaction type. Results: We scan 18 million PubMed abstracts to automatically identify 3253 new typed PPIs, including phosphorylation and acetylation interactions, with an overall precision of 46% (87% for acetylation) based on a human-reviewed sample. This work demonstrates that analysis of biomedical abstracts for PPI function extraction is a feasible approach to substantially increasing the number of interactions annotated with function captured in online databases.
    Neural Network Optimization for Reinforcement Learning Tasks Using Sparse Computations. (arXiv:2201.02571v1 [cs.LG])
    (2 min) This article proposes a sparse computation-based method for optimizing neural networks for reinforcement learning (RL) tasks. This method combines two ideas: neural network pruning and taking into account input data correlations; it makes it possible to update neuron states only when changes in them exceed a certain threshold. It significantly reduces the number of multiplications when running neural networks. We tested different RL tasks and achieved 20-150x reduction in the number of multiplications. There were no substantial performance losses; sometimes the performance even improved.
    Introducing Self-Attention to Target Attentive Graph Neural Networks. (arXiv:2107.01516v3 [cs.IR] UPDATED)
    (2 min) Session-based recommendation systems suggest relevant items to users by modeling user behavior and preferences using short-term anonymous sessions. Existing methods leverage Graph Neural Networks (GNNs) that propagate and aggregate information from neighboring nodes i.e., local message passing. Such graph-based architectures have representational limits, as a single sub-graph is susceptible to overfit the sequential dependencies instead of accounting for complex transitions between items in different sessions. We propose a new technique that leverages a Transformer in combination with a target attentive GNN. This allows richer representations to be learnt, which translates to empirical performance gains in comparison to a vanilla target attentive GNN. Our experimental results and ablation show that our proposed method is competitive with the existing methods on real-world benchmark datasets, improving on graph-based hypotheses. Code is available at https://github.com/The-Learning-Machines/SBR
    Learning to be adversarially robust and differentially private. (arXiv:2201.02265v1 [cs.LG])
    (2 min) We study the difficulties in learning that arise from robust and differentially private optimization. We first study convergence of gradient descent based adversarial training with differential privacy, taking a simple binary classification task on linearly separable data as an illustrative example. We compare the gap between adversarial and nominal risk in both private and non-private settings, showing that the data dimensionality dependent term introduced by private optimization compounds the difficulties of learning a robust model. After this, we discuss what parts of adversarial training and differential privacy hurt optimization, identifying that the size of adversarial perturbation and clipping norm in differential privacy both increase the curvature of the loss landscape, implying poorer generalization performance.
    Unsupervised Multimodal Language Representations using Convolutional Autoencoders. (arXiv:2110.03007v2 [cs.CL] UPDATED)
    (2 min) Multimodal Language Analysis is a demanding area of research, since it is associated with two requirements: combining different modalities and capturing temporal information. During the last years, several works have been proposed in the area, mostly centered around supervised learning in downstream tasks. In this paper we propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. Towards this end, we map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. Extensive experimentation on Sentiment Analysis (MOSEI) and Emotion Recognition (IEMOCAP) indicate that the learned representations can achieve near-state-of-the-art performance with just the use of a Logistic Regression algorithm for downstream classification. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters. The proposed multimodal representation models are open-sourced and will help grow the applicability of Multimodal Language.
    Universal Paralinguistic Speech Representations Using Self-Supervised Conformers. (arXiv:2110.04621v2 [cs.SD] UPDATED)
    (2 min) Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.
    Mirror Learning: A Unifying Framework of Policy Optimisation. (arXiv:2201.02373v1 [cs.LG])
    (2 min) General policy improvement (GPI) and trust-region learning (TRL) are the predominant frameworks within contemporary reinforcement learning (RL), which serve as the core models for solving Markov decision processes (MDPs). Unfortunately, in their mathematical form, they are sensitive to modifications, and thus, the practical instantiations that implement them do not automatically inherit their improvement guarantees. As a result, the spectrum of available rigorous MDP-solvers is narrow. Indeed, many state-of-the-art (SOTA) algorithms, such as TRPO and PPO, are not proven to converge. In this paper, we propose \textsl{mirror learning} -- a general solution to the RL problem. We reveal GPI and TRL to be but small points within this far greater space of algorithms which boasts the monotonic improvement property and converges to the optimal policy. We show that virtually all SOTA algorithms for RL are instances of mirror learning, and thus suggest that their empirical performance is a consequence of their theoretical properties, rather than of approximate analogies. Excitingly, we show that mirror learning opens up a whole new space of policy learning methods with convergence guarantees.
    Bayesian Online Change Point Detection for Baseline Shifts. (arXiv:2201.02325v1 [stat.ML])
    (2 min) In time series data analysis, detecting change points on a real-time basis (online) is of great interest in many areas, such as finance, environmental monitoring, and medicine. One promising means to achieve this is the Bayesian online change point detection (BOCPD) algorithm, which has been successfully adopted in particular cases in which the time series of interest has a fixed baseline. However, we have found that the algorithm struggles when the baseline irreversibly shifts from its initial state. This is because with the original BOCPD algorithm, the sensitivity with which a change point can be detected is degraded if the data points are fluctuating at locations relatively far from the original baseline. In this paper, we not only extend the original BOCPD algorithm to be applicable to a time series whose baseline is constantly shifting toward unknown values but also visualize why the proposed extension works. To demonstrate the efficacy of the proposed algorithm compared to the original one, we examine these algorithms on two real-world data sets and six synthetic data sets.
    GenLabel: Mixup Relabeling using Generative Models. (arXiv:2201.02354v1 [cs.LG])
    (2 min) Mixup is a data augmentation method that generates new data points by mixing a pair of input data. While mixup generally improves the prediction performance, it sometimes degrades the performance. In this paper, we first identify the main causes of this phenomenon by theoretically and empirically analyzing the mixup algorithm. To resolve this, we propose GenLabel, a simple yet effective relabeling algorithm designed for mixup. In particular, GenLabel helps the mixup algorithm correctly label mixup samples by learning the class-conditional data distribution using generative models. Via extensive theoretical and empirical analysis, we show that mixup, when used together with GenLabel, can effectively resolve the aforementioned phenomenon, improving the generalization performance and the adversarial robustness.
    3D Intracranial Aneurysm Classification and Segmentation via Unsupervised Dual-branch Learning. (arXiv:2201.02198v1 [eess.IV])
    (2 min) Intracranial aneurysms are common nowadays and how to detect them intelligently is of great significance in digital health. While most existing deep learning research focused on medical images in a supervised way, we introduce an unsupervised method for the detection of intracranial aneurysms based on 3D point cloud data. In particular, our method consists of two stages: unsupervised pre-training and downstream tasks. As for the former, the main idea is to pair each point cloud with its jittered counterpart and maximise their correspondence. Then we design a dual-branch contrastive network with an encoder for each branch and a subsequent common projection head. As for the latter, we design simple networks for supervised classification and segmentation training. Experiments on the public dataset (IntrA) show that our unsupervised method achieves comparable or even better performance than some state-of-the-art supervised techniques, and it is most prominent in the detection of aneurysmal vessels. Experiments on the ModelNet40 also show that our method achieves the accuracy of 90.79\% which outperforms existing state-of-the-art unsupervised models.
    k-Center Clustering with Outliers in Sliding Windows. (arXiv:2201.02448v1 [cs.LG])
    (2 min) Metric $k$-center clustering is a fundamental unsupervised learning primitive. Although widely used, this primitive is heavily affected by noise in the data, so that a more sensible variant seeks for the best solution that disregards a given number $z$ of points of the dataset, called outliers. We provide efficient algorithms for this important variant in the streaming model under the sliding window setting, where, at each time step, the dataset to be clustered is the window $W$ of the most recent data items. Our algorithms achieve $O(1)$ approximation and, remarkably, require a working memory linear in $k+z$ and only logarithmic in $|W|$. As a by-product, we show how to estimate the effective diameter of the window $W$, which is a measure of the spread of the window points, disregarding a given fraction of noisy distances. We also provide experimental evidence of the practical viability of our theoretical results.
    Audiomer: A Convolutional Transformer For Keyword Spotting. (arXiv:2109.10252v3 [cs.LG] UPDATED)
    (2 min) Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or incur a performance penalty when trained on Fourier-based features. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in keyword spotting with raw audio waveforms, outperforming all previous methods while being computationally cheaper and parameter-efficient. Additionally, our model has practical advantages for speech processing, such as inference on arbitrarily long audio clips owing to the absence of positional encoding. The code is available at https://github.com/The-Learning-Machines/Audiomer-PyTorch.
    Spatial-Temporal Sequential Hypergraph Network for Crime Prediction. (arXiv:2201.02435v1 [cs.LG])
    (2 min) Crime prediction is crucial for public safety and resource optimization, yet is very challenging due to two aspects: i) the dynamics of criminal patterns across time and space, crime events are distributed unevenly on both spatial and temporal domains; ii) time-evolving dependencies between different types of crimes (e.g., Theft, Robbery, Assault, Damage) which reveal fine-grained semantics of crimes. To tackle these challenges, we propose Spatial-Temporal Sequential Hypergraph Network (ST-SHN) to collectively encode complex crime spatial-temporal patterns as well as the underlying category-wise crime semantic relationships. In specific, to handle spatial-temporal dynamics under the long-range and global context, we design a graph-structured message passing architecture with the integration of the hypergraph learning paradigm. To capture category-wise crime heterogeneous relations in a dynamic environment, we introduce a multi-channel routing mechanism to learn the time-evolving structural dependency across crime types. We conduct extensive experiments on two real-world datasets, showing that our proposed ST-SHN framework can significantly improve the prediction performance as compared to various state-of-the-art baselines. The source code is available at: https://github.com/akaxlh/ST-SHN.
    Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving. (arXiv:2101.06784v3 [cs.CV] UPDATED)
    (2 min) Modern self-driving perception systems have been shown to improve upon processing complementary inputs such as LiDAR with images. In isolation, 2D images have been found to be extremely vulnerable to adversarial attacks. Yet, there have been limited studies on the adversarial robustness of multi-modal models that fuse LiDAR features with image features. Furthermore, existing works do not consider physically realizable perturbations that are consistent across the input modalities. In this paper, we showcase practical susceptibilities of multi-sensor detection by placing an adversarial object on top of a host vehicle. We focus on physically realizable and input-agnostic attacks as they are feasible to execute in practice, and show that a single universal adversary can hide different host vehicles from state-of-the-art multi-modal detectors. Our experiments demonstrate that successful attacks are primarily caused by easily corrupted image features. Furthermore, we find that in modern sensor fusion methods which project image features into 3D, adversarial attacks can exploit the projection process to generate false positives across distant regions in 3D. Towards more robust multi-modal perception systems, we show that adversarial training with feature denoising can boost robustness to such attacks significantly. However, we find that standard adversarial defenses still struggle to prevent false positives which are also caused by inaccurate associations between 3D LiDAR points and 2D pixels.
    Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT. (arXiv:2201.02229v1 [cs.LG])
    (2 min) Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time nor cost-effective. We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models - dubbed PPI-BioBERT-x10 to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter ~ 5700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.
    Contrastive Code Representation Learning. (arXiv:2007.04973v4 [cs.LG] UPDATED)
    (2 min) Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code clone detection dataset, showing that ContraCode is both more robust and semantically meaningful. On it, we outperform RoBERTa by 39% AUROC in an adversarial setting and up to 5% on natural code.
    Stochastic Saddle Point Problems with Decision-Dependent Distributions. (arXiv:2201.02313v1 [math.OC])
    (2 min) This paper focuses on stochastic saddle point problems with decision-dependent distributions in both the static and time-varying settings. These are problems whose objective is the expected value of a stochastic payoff function, where random variables are drawn from a distribution induced by a distributional map. For general distributional maps, the problem of finding saddle points is in general computationally burdensome, even if the distribution is known. To enable a tractable solution approach, we introduce the notion of equilibrium points -- which are saddle points for the stationary stochastic minimax problem that they induce -- and provide conditions for their existence and uniqueness. We demonstrate that the distance between the two classes of solutions is bounded provided that the objective has a strongly-convex-strongly-concave payoff and Lipschitz continuous distributional map. We develop deterministic and stochastic primal-dual algorithms and demonstrate their convergence to the equilibrium point. In particular, by modeling errors emerging from a stochastic gradient estimator as sub-Weibull random variables, we provide error bounds in expectation and in high probability that hold for each iteration; moreover, we show convergence to a neighborhood in expectation and almost surely. Finally, we investigate a condition on the distributional map -- which we call opposing mixture dominance -- that ensures the objective is strongly-convex-strongly-concave. Under this assumption, we show that primal-dual algorithms converge to the saddle points in a similar fashion.
    A Theoretical Framework of Almost Hyperparameter-free Hyperparameter Selection Methods for Offline Policy Evaluation. (arXiv:2201.02300v1 [stat.ML])
    (2 min) We are concerned with the problem of hyperparameter selection of offline policy evaluation (OPE). OPE is a key component of offline reinforcement learning, which is a core technology for data-driven decision optimization without environment simulators. However, the current state-of-the-art OPE methods are not hyperparameter-free, which undermines their utility in real-life applications. We address this issue by introducing a new approximate hyperparameter selection (AHS) framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as convergence rate and time complexity. Finally, we verify effectiveness and limitation of these methods with a preliminary experiment.
    Nonlocal Kernel Network (NKN): a Stable and Resolution-Independent Deep Neural Network. (arXiv:2201.02217v1 [cs.LG])
    (2 min) Neural operators have recently become popular tools for designing solution maps between function spaces in the form of neural networks. Differently from classical scientific machine learning approaches that learn parameters of a known partial differential equation (PDE) for a single instance of the input parameters at a fixed resolution, neural operators approximate the solution map of a family of PDEs. Despite their success, the uses of neural operators are so far restricted to relatively shallow neural networks and confined to learning hidden governing laws. In this work, we propose a novel nonlocal neural operator, which we refer to as nonlocal kernel network (NKN), that is resolution independent, characterized by deep neural networks, and capable of handling a variety of tasks such as learning governing equations and classifying images. Our NKN stems from the interpretation of the neural network as a discrete nonlocal diffusion reaction equation that, in the limit of infinite layers, is equivalent to a parabolic nonlocal equation, whose stability is analyzed via nonlocal vector calculus. The resemblance with integral forms of neural operators allows NKNs to capture long-range dependencies in the feature space, while the continuous treatment of node-to-node interactions makes NKNs resolution independent. The resemblance with neural ODEs, reinterpreted in a nonlocal sense, and the stable network dynamics between layers allow for generalization of NKN's optimal parameters from shallow to deep networks. This fact enables the use of shallow-to-deep initialization techniques. Our tests show that NKNs outperform baseline methods in both learning governing equations and image classification tasks and generalize well to different resolutions and depths.
    PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. (arXiv:2201.02273v1 [q-bio.GN])
    (3 min) COVID-19 pandemic, is still unknown and is an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona-) viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is an important part of determining host specificity since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among avians, bats, camels, swines, humans and weasels, to name a few. We propose a feature embedding based on the well-known position-weight matrix (PWM), which we call PWM2Vec, and use to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications such as determining protein function, or identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs in the context of host classification from viral sequences to generate a fixed-length feature vector representation. The results on the real world data show that in using PWM2Vec, we are able to perform comparably well as compared to baseline models. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus.
    Stereoscopic Universal Perturbations across Different Architectures and Datasets. (arXiv:2112.06116v2 [cs.CV] UPDATED)
    (2 min) We study the effect of adversarial perturbations of images on deep stereo matching networks for the disparity estimation task. We present a method to craft a single set of perturbations that, when added to any stereo image pair in a dataset, can fool a stereo network to significantly alter the perceived scene geometry. Our perturbation images are "universal" in that they not only corrupt estimates of the network on the dataset they are optimized for, but also generalize to stereo networks with different architectures across different datasets. We evaluate our approach on multiple public benchmark datasets and show that our perturbations can increase D1-error (akin to fooling rate) of state-of-the-art stereo networks from 1% to as much as 87%. We investigate the effect of perturbations on the estimated scene geometry and identify object classes that are most vulnerable. Our analysis on the activations of registered points between left and right images led us to find that certain architectural components, i.e. deformable convolution and explicit matching, can increase robustness against adversaries. We demonstrate that by simply designing networks with such components, one can reduce the effect of adversaries by up to 60.5%, which rivals the robustness of networks fine-tuned with costly adversarial data augmentation.
    Leveraging Scale-Invariance and Uncertainity with Self-Supervised Domain Adaptation for Semantic Segmentation of Foggy Scenes. (arXiv:2201.02588v1 [cs.CV])
    (2 min) This paper presents FogAdapt, a novel approach for domain adaptation of semantic segmentation for dense foggy scenes. Although significant research has been directed to reduce the domain shift in semantic segmentation, adaptation to scenes with adverse weather conditions remains an open question. Large variations in the visibility of the scene due to weather conditions, such as fog, smog, and haze, exacerbate the domain shift, thus making unsupervised adaptation in such scenarios challenging. We propose a self-entropy and multi-scale information augmented self-supervised domain adaptation method (FogAdapt) to minimize the domain shift in foggy scenes segmentation. Supported by the empirical evidence that an increase in fog density results in high self-entropy for segmentation probabilities, we introduce a self-entropy based loss function to guide the adaptation method. Furthermore, inferences obtained at different image scales are combined and weighted by the uncertainty to generate scale-invariant pseudo-labels for the target domain. These scale-invariant pseudo-labels are robust to visibility and scale variations. We evaluate the proposed model on real clear-weather scenes to real foggy scenes adaptation and synthetic non-foggy images to real foggy scenes adaptation scenarios. Our experiments demonstrate that FogAdapt significantly outperforms the current state-of-the-art in semantic segmentation of foggy images. Specifically, by considering the standard settings compared to state-of-the-art (SOTA) methods, FogAdapt gains 3.8% on Foggy Zurich, 6.0% on Foggy Driving-dense, and 3.6% on Foggy Driving in mIoU when adapted from Cityscapes to Foggy Zurich.
    Audio representations for deep learning in sound synthesis: A review. (arXiv:2201.02490v1 [cs.SD])
    (2 min) The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficiently - and complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.
    Learning Multi-Tasks with Inconsistent Labels by using Auxiliary Big Task. (arXiv:2201.02305v1 [cs.LG])
    (2 min) Multi-task learning is to improve the performance of the model by transferring and exploiting common knowledge among tasks. Existing MTL works mainly focus on the scenario where label sets among multiple tasks (MTs) are usually the same, thus they can be utilized for learning across the tasks. While almost rare works explore the scenario where each task only has a small amount of training samples, and their label sets are just partially overlapped or even not. Learning such MTs is more challenging because of less correlation information available among these tasks. For this, we propose a framework to learn these tasks by jointly leveraging both abundant information from a learnt auxiliary big task with sufficiently many classes to cover those of all these tasks and the information shared among those partially-overlapped tasks. In our implementation of using the same neural network architecture of the learnt auxiliary task to learn individual tasks, the key idea is to utilize available label information to adaptively prune the hidden layer neurons of the auxiliary network to construct corresponding network for each task, while accompanying a joint learning across individual tasks. Our experimental results demonstrate its effectiveness in comparison with the state-of-the-art approaches.
    Inferring Turbulent Parameters via Machine Learning. (arXiv:2201.00732v1 [physics.flu-dyn] CROSS LISTED)
    (2 min) We design a machine learning technique to solve the general problem of inferring physical parameters from the observation of turbulent flows, a relevant exercise in many theoretical and applied fields, from engineering to earth observation and astrophysics. Our approach is to train the machine learning system to regress the rotation frequency of the flow's reference frame, from the observation of the flow's velocity amplitude on a 2d plane extracted from the 3d domain. The machine learning approach consists of a Deep Convolutional Neural Network (DCNN) of the same kind developed in computer vision. The training and validation datasets are produced by means of fully resolved direct numerical simulations. This study shows interesting results from two different points of view. From the machine learning point of view it shows the potential of DCNN, reaching good results on such a particularly complex problem that goes well outside the limits of human vision. Second, from the physics point of view, it provides an example on how machine learning can be exploited in data analysis to infer information that would be inaccessible otherwise. Indeed, by comparing DCNN with the other possible Bayesian approaches, we find that DCNN yields to a much higher inference accuracy in all the examined cases.
    Generalized Category Discovery. (arXiv:2201.02609v1 [cs.CV])
    (2 min) In this paper, we consider a highly general image recognition setting wherein, given a labelled and unlabelled set of images, the task is to categorize all images in the unlabelled set. Here, the unlabelled images may come from labelled classes or from novel ones. Existing recognition methods are not able to deal with this setting, because they make several restrictive assumptions, such as the unlabelled instances only coming from known - or unknown - classes and the number of unknown classes being known a-priori. We address the more unconstrained setting, naming it 'Generalized Category Discovery', and challenge all these assumptions. We first establish strong baselines by taking state-of-the-art algorithms from novel category discovery and adapting them for this task. Next, we propose the use of vision transformers with contrastive representation learning for this open world setting. We then introduce a simple yet effective semi-supervised $k$-means method to cluster the unlabelled data into seen and unseen classes automatically, substantially outperforming the baselines. Finally, we also propose a new approach to estimate the number of classes in the unlabelled data. We thoroughly evaluate our approach on public datasets for generic object classification including CIFAR10, CIFAR100 and ImageNet-100, and for fine-grained visual recognition including CUB, Stanford Cars and Herbarium19, benchmarking on this new setting to foster future research.
    On robust risk-based active-learning algorithms for enhanced decision support. (arXiv:2201.02555v1 [cs.LG])
    (2 min) Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins. Previous work introduced \textit{risk-based active learning}, an online approach for the development of statistical classifiers that takes into account the decision-support context in which they are applied. Decision-making is considered by preferentially querying data labels according to \textit{expected value of perfect information} (EVPI). Although several benefits are gained by adopting a risk-based active learning approach, including improved decision-making performance, the algorithms suffer from issues relating to sampling bias as a result of the guided querying process. This sampling bias ultimately manifests as a decline in decision-making performance during the later stages of active learning, which in turn corresponds to lost resource/utility. The current paper proposes two novel approaches to counteract the effects of sampling bias: \textit{semi-supervised learning}, and \textit{discriminative classification models}. These approaches are first visualised using a synthetic dataset, then subsequently applied to an experimental case study, specifically, the Z24 Bridge dataset. The semi-supervised learning approach is shown to have variable performance; with robustness to sampling bias dependent on the suitability of the generative distributions selected for the model with respect to each dataset. In contrast, the discriminative classifiers are shown to have excellent robustness to the effects of sampling bias. Moreover, it was found that the number of inspections made during a monitoring campaign, and therefore resource expenditure, could be reduced with the careful selection of the statistical classifiers used within a decision-supporting monitoring system.
  • stat.ML updates on arXiv.org

    Learning to Transfer with von Neumann Conditional Divergence. (arXiv:2108.03531v2 [cs.LG] UPDATED)
    (0 min) The similarity of feature representations plays a pivotal role in the success of problems related to domain adaptation. Feature similarity includes both the invariance of marginal distributions and the closeness of conditional distributions given the desired response $y$ (e.g., class labels). Unfortunately, traditional methods always learn such features without fully taking into consideration the information in $y$, which in turn may lead to a mismatch of the conditional distributions or the mix-up of discriminative structures underlying data distributions. In this work, we introduce the recently proposed von Neumann conditional divergence to improve the transferability across multiple domains. We show that this new divergence is differentiable and eligible to easily quantify the functional dependence between features and $y$. Given multiple source tasks, we integrate this divergence to capture discriminative information in $y$ and design novel learning objectives assuming those source tasks are observed either simultaneously or sequentially. In both scenarios, we obtain favorable performance against state-of-the-art methods in terms of smaller generalization error on new tasks and less catastrophic forgetting on source tasks (in the sequential setup).
    Unified Field Theory for Deep and Recurrent Neural Networks. (arXiv:2112.05589v2 [cond-mat.dis-nn] UPDATED)
    (2 min) Understanding capabilities and limitations of different network architectures is of fundamental importance to machine learning. Bayesian inference on Gaussian processes has proven to be a viable approach for studying recurrent and deep networks in the limit of infinite layer width, $n\to\infty$. Here we present a unified and systematic derivation of the mean-field theory for both architectures that starts from first principles by employing established methods from statistical physics of disordered systems. The theory elucidates that while the mean-field equations are different with regard to their temporal structure, they yet yield identical Gaussian kernels when readouts are taken at a single time point or layer, respectively. Bayesian inference applied to classification then predicts identical performance and capabilities for the two architectures. Numerically, we find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks and the convergence speed depends non-trivially on the parameters of the weight prior as well as the depth or number of time steps, respectively. Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$. The formalism thus paves the way to investigate the fundamental differences between recurrent and deep architectures at finite widths $n$.
    Towards Lower Bounds on the Depth of ReLU Neural Networks. (arXiv:2105.14835v3 [cs.LG] UPDATED)
    (2 min) We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). This problem has potential impact on algorithmic and statistical aspects because of the insight it provides into the class of functions represented by neural hypothesis classes. However, to the best of our knowledge, this question has not been investigated in the neural network literature. We also present upper bounds on the sizes of neural networks required to represent functions in these neural hypothesis classes.
    Applications of Signature Methods to Market Anomaly Detection. (arXiv:2201.02441v1 [q-fin.CP])
    (2 min) Anomaly detection is the process of identifying abnormal instances or events in data sets which deviate from the norm significantly. In this study, we propose a signatures based machine learning algorithm to detect rare or unexpected items in a given data set of time series type. We present applications of signature or randomized signature as feature extractors for anomaly detection algorithms; additionally we provide an easy, representation theoretic justification for the construction of randomized signatures. Our first application is based on synthetic data and aims at distinguishing between real and fake trajectories of stock prices, which are indistinguishable by visual inspection. We also show a real life application by using transaction data from the cryptocurrency market. In this case, we are able to identify pump and dump attempts organized on social networks with F1 scores up to 88% by means of our unsupervised learning algorithm, thus achieving results that are close to the state-of-the-art in the field based on supervised learning.
    A Theoretical Framework of Almost Hyperparameter-free Hyperparameter Selection Methods for Offline Policy Evaluation. (arXiv:2201.02300v1 [stat.ML])
    (2 min) We are concerned with the problem of hyperparameter selection of offline policy evaluation (OPE). OPE is a key component of offline reinforcement learning, which is a core technology for data-driven decision optimization without environment simulators. However, the current state-of-the-art OPE methods are not hyperparameter-free, which undermines their utility in real-life applications. We address this issue by introducing a new approximate hyperparameter selection (AHS) framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as convergence rate and time complexity. Finally, we verify effectiveness and limitation of these methods with a preliminary experiment.
    Lattice-Based Methods Surpass Sum-of-Squares in Clustering. (arXiv:2112.03898v2 [cs.LG] UPDATED)
    (2 min) Clustering is a fundamental primitive in unsupervised learning which gives rise to a rich class of computationally-challenging inference tasks. In this work, we focus on the canonical task of clustering d-dimensional Gaussian mixtures with unknown (and possibly degenerate) covariance. Recent works (Ghosh et al. '20; Mao, Wein '21; Davis, Diaz, Wang '21) have established lower bounds against the class of low-degree polynomial methods and the sum-of-squares (SoS) hierarchy for recovering certain hidden structures planted in Gaussian clustering instances. Prior work on many similar inference tasks portends that such lower bounds strongly suggest the presence of an inherent statistical-to-computational gap for clustering, that is, a parameter regime where the clustering task is statistically possible but no polynomial-time algorithm succeeds. One special case of the clustering task we consider is equivalent to the problem of finding a planted hypercube vector in an otherwise random subspace. We show that, perhaps surprisingly, this particular clustering model does not exhibit a statistical-to-computational gap, even though the aforementioned low-degree and SoS lower bounds continue to apply in this case. To achieve this, we give a polynomial-time algorithm based on the Lenstra--Lenstra--Lovasz lattice basis reduction method which achieves the statistically-optimal sample complexity of d+1 samples. This result extends the class of problems whose conjectured statistical-to-computational gaps can be "closed" by "brittle" polynomial-time algorithms, highlighting the crucial but subtle role of noise in the onset of statistical-to-computational gaps.
    Causal Mediation Analysis with Hidden Confounders. (arXiv:2102.11724v3 [stat.ML] UPDATED)
    (2 min) An important problem in causal inference is to break down the total effect of a treatment on an outcome into different causal pathways and to quantify the causal effect in each pathway. For instance, in causal fairness, the total effect of being a male employee (i.e., treatment) constitutes its direct effect on annual income (i.e., outcome) and the indirect effect via the employee's occupation (i.e., mediator). Causal mediation analysis (CMA) is a formal statistical framework commonly used to reveal such underlying causal mechanisms. One major challenge of CMA in observational studies is handling confounders, variables that cause spurious causal relationships among treatment, mediator, and outcome. Conventional methods assume sequential ignorability that implies all confounders can be measured, which is often unverifiable in practice. This work aims to circumvent the stringent sequential ignorability assumptions and consider hidden confounders. Drawing upon proxy strategies and recent advances in deep learning, we propose to simultaneously uncover the latent variables that characterize hidden confounders and estimate the causal effects. Empirical evaluations using both synthetic and semi-synthetic datasets validate the effectiveness of the proposed method. We further show the potentials of our approach for causal fairness analysis.
    Nonlocal Kernel Network (NKN): a Stable and Resolution-Independent Deep Neural Network. (arXiv:2201.02217v1 [cs.LG])
    (2 min) Neural operators have recently become popular tools for designing solution maps between function spaces in the form of neural networks. Differently from classical scientific machine learning approaches that learn parameters of a known partial differential equation (PDE) for a single instance of the input parameters at a fixed resolution, neural operators approximate the solution map of a family of PDEs. Despite their success, the uses of neural operators are so far restricted to relatively shallow neural networks and confined to learning hidden governing laws. In this work, we propose a novel nonlocal neural operator, which we refer to as nonlocal kernel network (NKN), that is resolution independent, characterized by deep neural networks, and capable of handling a variety of tasks such as learning governing equations and classifying images. Our NKN stems from the interpretation of the neural network as a discrete nonlocal diffusion reaction equation that, in the limit of infinite layers, is equivalent to a parabolic nonlocal equation, whose stability is analyzed via nonlocal vector calculus. The resemblance with integral forms of neural operators allows NKNs to capture long-range dependencies in the feature space, while the continuous treatment of node-to-node interactions makes NKNs resolution independent. The resemblance with neural ODEs, reinterpreted in a nonlocal sense, and the stable network dynamics between layers allow for generalization of NKN's optimal parameters from shallow to deep networks. This fact enables the use of shallow-to-deep initialization techniques. Our tests show that NKNs outperform baseline methods in both learning governing equations and image classification tasks and generalize well to different resolutions and depths.
    GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks. (arXiv:2201.02283v1 [stat.ML])
    (2 min) We develop the "generalized consistent weighted sampling" (GCWS) for hashing the "powered-GMM" (pGMM) kernel (with a tuning parameter $p$). It turns out that GCWS provides a numerically stable scheme for applying power transformation on the original data, regardless of the magnitude of $p$ and the data. The power transformation is often effective for boosting the performance, in many cases considerably so. We feed the hashed data to neural networks on a variety of public classification datasets and name our method ``GCWSNet''. Our extensive experiments show that GCWSNet often improves the classification accuracy. Furthermore, it is evident from the experiments that GCWSNet converges substantially faster. In fact, GCWS often reaches a reasonable accuracy with merely (less than) one epoch of the training process. This property is much desired because many applications, such as advertisement click-through rate (CTR) prediction models, or data streams (i.e., data seen only once), often train just one epoch. Another beneficial side effect is that the computations of the first layer of the neural networks become additions instead of multiplications because the input data become binary (and highly sparse). Empirical comparisons with (normalized) random Fourier features (NRFF) are provided. We also propose to reduce the model size of GCWSNet by count-sketch and develop the theory for analyzing the impact of using count-sketch on the accuracy of GCWS. Our analysis shows that an ``8-bit'' strategy should work well in that we can always apply an 8-bit count-sketch hashing on the output of GCWS hashing without hurting the accuracy much. There are many other ways to take advantage of GCWS when training deep neural networks. For example, one can apply GCWS on the outputs of the last layer to boost the accuracy of trained deep neural networks.
    On robust risk-based active-learning algorithms for enhanced decision support. (arXiv:2201.02555v1 [cs.LG])
    (2 min) Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins. Previous work introduced \textit{risk-based active learning}, an online approach for the development of statistical classifiers that takes into account the decision-support context in which they are applied. Decision-making is considered by preferentially querying data labels according to \textit{expected value of perfect information} (EVPI). Although several benefits are gained by adopting a risk-based active learning approach, including improved decision-making performance, the algorithms suffer from issues relating to sampling bias as a result of the guided querying process. This sampling bias ultimately manifests as a decline in decision-making performance during the later stages of active learning, which in turn corresponds to lost resource/utility. The current paper proposes two novel approaches to counteract the effects of sampling bias: \textit{semi-supervised learning}, and \textit{discriminative classification models}. These approaches are first visualised using a synthetic dataset, then subsequently applied to an experimental case study, specifically, the Z24 Bridge dataset. The semi-supervised learning approach is shown to have variable performance; with robustness to sampling bias dependent on the suitability of the generative distributions selected for the model with respect to each dataset. In contrast, the discriminative classifiers are shown to have excellent robustness to the effects of sampling bias. Moreover, it was found that the number of inspections made during a monitoring campaign, and therefore resource expenditure, could be reduced with the careful selection of the statistical classifiers used within a decision-supporting monitoring system.
    A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review. (arXiv:2201.02539v1 [stat.ME])
    (2 min) Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows' $\phi$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus between judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty.
    Linear classifier, least-squares cost function, and outliers. (arXiv:1808.09222v2 [physics.data-an] UPDATED)
    (2 min) A set of introductory notes on the subject of data classification using a linear classifier and least-squares cost function, and the negative effect of the presence of outliers on the decision boundary of the linear discriminant. We also show how a simple scaling could make the outlier less significant, thereby obtaining a much better decision boundary. We present some numerical results.
    PAC-Bayesian Matrix Completion with a Spectral Scaled Student Prior. (arXiv:2104.08191v2 [stat.ML] UPDATED)
    (2 min) We study the problem of matrix completion in this paper. A spectral scaled Student prior is exploited to favour the underlying low-rank structure of the data matrix. We provide a thorough theoretical investigation for our approach through PAC-Bayesian bounds. More precisely, our PAC-Bayesian approach enjoys a minimax-optimal oracle inequality which guarantees that our method works well under model misspecification and under general sampling distribution. Interestingly, we also provide efficient gradient-based sampling implementations for our approach by using Langevin Monte Carlo. More specifically, we show that our algorithms are significantly faster than Gibbs sampler in this problem. To illustrate the attractive features of our inference strategy, some numerical simulations are conducted and an application to image inpainting is demonstrated.
    A framework for deep learning emulation of numerical models with a case study in satellite remote sensing. (arXiv:1910.13408v2 [cs.LG] UPDATED)
    (3 min) Numerical models based on physics represent the state-of-the-art in earth system modeling and comprise our best tools for generating insights and predictions. Despite rapid growth in computational power, the perceived need for higher model resolutions overwhelms the latest-generation computers, reducing the ability of modelers to generate simulations for understanding parameter sensitivities and characterizing variability and uncertainty. Thus, surrogate models are often developed to capture the essential attributes of the full-blown numerical models. Recent successes of machine learning methods, especially deep learning, across many disciplines offer the possibility that complex nonlinear connectionist representations may be able to capture the underlying complex structures and nonlinear processes in earth systems. A difficult test for deep learning-based emulation, which refers to function approximation of numerical models, is to understand whether they can be comparable to traditional forms of surrogate models in terms of computational efficiency while simultaneously reproducing model results in a credible manner. A deep learning emulation that passes this test may be expected to perform even better than simple models with respect to capturing complex processes and spatiotemporal dependencies. Here we examine, with a case study in satellite-based remote sensing, the hypothesis that deep learning approaches can credibly represent the simulations from a surrogate model with comparable computational efficiency. Our results are encouraging in that the deep learning emulation reproduces the results with acceptable accuracy and often even faster performance. We discuss the broader implications of our results in light of the pace of improvements in high-performance implementations of deep learning as well as the growing desire for higher-resolution simulations in the earth sciences.
    AugmentedPCA: A Python Package of Supervised and Adversarial Linear Factor Models. (arXiv:2201.02547v1 [stat.ML])
    (2 min) Deep autoencoders are often extended with a supervised or adversarial loss to learn latent representations with desirable properties, such as greater predictivity of labels and outcomes or fairness with respects to a sensitive variable. Despite the ubiquity of supervised and adversarial deep latent factor models, these methods should demonstrate improvement over simpler linear approaches to be preferred in practice. This necessitates a reproducible linear analog that still adheres to an augmenting supervised or adversarial objective. We address this methodological gap by presenting methods that augment the principal component analysis (PCA) objective with either a supervised or an adversarial objective and provide analytic and reproducible solutions. We implement these methods in an open-source Python package, AugmentedPCA, that can produce excellent real-world baselines. We demonstrate the utility of these factor models on an open-source, RNA-seq cancer gene expression dataset, showing that augmenting with a supervised objective results in improved downstream classification performance, produces principal components with greater class fidelity, and facilitates identification of genes aligned with the principal axes of data variance with implications to development of specific types of cancer.
    Path classification by stochastic linear recurrent neural networks. (arXiv:2108.03090v2 [stat.ML] UPDATED)
    (2 min) We investigate the functioning of a classifying biological neural network from the perspective of statistical learning theory, modelled, in a simplified setting, as a continuous-time stochastic recurrent neural network (RNN) with identity activation function. In the purely stochastic (robust) regime, we give a generalisation error bound that holds with high probability, thus showing that the empirical risk minimiser is the best-in-class hypothesis. We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks. We argue that these RNNs are easy to train and robust and back these observations with numerical experiments on both synthetic and real data. We also exhibit a trade-off phenomenon between accuracy and robustness.
    Contrastive Code Representation Learning. (arXiv:2007.04973v4 [cs.LG] UPDATED)
    (2 min) Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code clone detection dataset, showing that ContraCode is both more robust and semantically meaningful. On it, we outperform RoBERTa by 39% AUROC in an adversarial setting and up to 5% on natural code.
    Time Series Anomaly Detection for Cyber-Physical Systems via Neural System Identification and Bayesian Filtering. (arXiv:2106.07992v2 [cs.LG] UPDATED)
    (2 min) Recent advances in AIoT technologies have led to an increasing popularity of utilizing machine learning algorithms to detect operational failures for cyber-physical systems (CPS). In its basic form, an anomaly detection module monitors the sensor measurements and actuator states from the physical plant, and detects anomalies in these measurements to identify abnormal operation status. Nevertheless, building effective anomaly detection models for CPS is rather challenging as the model has to accurately detect anomalies in presence of highly complicated system dynamics and unknown amount of sensor noise. In this work, we propose a novel time series anomaly detection method called Neural System Identification and Bayesian Filtering (NSIBF) in which a specially crafted neural network architecture is posed for system identification, i.e., capturing the dynamics of CPS in a dynamical state-space model; then a Bayesian filtering algorithm is naturally applied on top of the "identified" state-space model for robust anomaly detection by tracking the uncertainty of the hidden state of the system recursively over time. We provide qualitative as well as quantitative experiments with the proposed method on a synthetic and three real-world CPS datasets, showing that NSIBF compares favorably to the state-of-the-art methods with considerable improvements on anomaly detection in CPS.
    Acoustic Neighbor Embeddings. (arXiv:2007.10329v5 [eess.AS] UPDATED)
    (2 min) This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a triplet loss criterion, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results with low-dimensional embeddings when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic matching task. In particular, in an isolated name recognition task depending solely on Euclidean nearest-neighbor search between the proposed embedding vectors, the recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
    Generalized quantum similarity learning. (arXiv:2201.02310v1 [quant-ph])
    (2 min) The similarity between objects is significant in a broad range of areas. While similarity can be measured using off-the-shelf distance functions, they may fail to capture the inherent meaning of similarity, which tends to depend on the underlying data and task. Moreover, conventional distance functions limit the space of similarity measures to be symmetric and do not directly allow comparing objects from different spaces. We propose using quantum networks (GQSim) for learning task-dependent (a)symmetric similarity between data that need not have the same dimensionality. We analyze the properties of such similarity function analytically (for a simple case) and numerically (for a complex case) and showthat these similarity measures can extract salient features of the data. We also demonstrate that the similarity measure derived using this technique is $(\epsilon,\gamma,\tau)$-good, resulting in theoretically guaranteed performance. Finally, we conclude by applying this technique for three relevant applications - Classification, Graph Completion, Generative modeling.
    Bayesian Online Change Point Detection for Baseline Shifts. (arXiv:2201.02325v1 [stat.ML])
    (2 min) In time series data analysis, detecting change points on a real-time basis (online) is of great interest in many areas, such as finance, environmental monitoring, and medicine. One promising means to achieve this is the Bayesian online change point detection (BOCPD) algorithm, which has been successfully adopted in particular cases in which the time series of interest has a fixed baseline. However, we have found that the algorithm struggles when the baseline irreversibly shifts from its initial state. This is because with the original BOCPD algorithm, the sensitivity with which a change point can be detected is degraded if the data points are fluctuating at locations relatively far from the original baseline. In this paper, we not only extend the original BOCPD algorithm to be applicable to a time series whose baseline is constantly shifting toward unknown values but also visualize why the proposed extension works. To demonstrate the efficacy of the proposed algorithm compared to the original one, we examine these algorithms on two real-world data sets and six synthetic data sets.
    Similarities and Differences between Machine Learning and Traditional Advanced Statistical Modeling in Healthcare Analytics. (arXiv:2201.02469v1 [cs.LG])
    (2 min) Data scientists and statisticians are often at odds when determining the best approach, machine learning or statistical modeling, to solve an analytics challenge. However, machine learning and statistical modeling are more cousins than adversaries on different sides of an analysis battleground. Choosing between the two approaches or in some cases using both is based on the problem to be solved and outcomes required as well as the data available for use and circumstances of the analysis. Machine learning and statistical modeling are complementary, based on similar mathematical principles, but simply using different tools in an overall analytics knowledge base. Determining the predominant approach should be based on the problem to be solved as well as empirical evidence, such as size and completeness of the data, number of variables, assumptions or lack thereof, and expected outcomes such as predictions or causality. Good analysts and data scientists should be well versed in both techniques and their proper application, thereby using the right tool for the right project to achieve the desired results.
    Nonparametric learning for impulse control problems. (arXiv:1909.09528v3 [math.OC] UPDATED)
    (2 min) One of the fundamental assumptions in stochastic control of continuous time processes is that the dynamics of the underlying (diffusion) process is known. This is, however, usually obviously not fulfilled in practice. On the other hand, over the last decades, a rich theory for nonparametric estimation of the drift (and volatility) for continuous time processes has been developed. The aim of this paper is bringing together techniques from stochastic control with methods from statistics for stochastic processes to find a way to both learn the dynamics of the underlying process and control in a reasonable way at the same time. More precisely, we study a long-term average impulse control problem, a stochastic version of the classical Faustmann timber harvesting problem. One of the problems that immediately arises is an exploration-exploitation dilemma as is well known for problems in machine learning. We propose a way to deal with this issue by combining exploration and exploitation periods in a suitable way. Our main finding is that this construction can be based on the rates of convergence of estimators for the invariant density. Using this, we obtain that the average cumulated regret is of uniform order $O({T^{-1/3}})$.
    Optimality in Noisy Importance Sampling. (arXiv:2201.02432v1 [stat.ML])
    (2 min) In this work, we analyze the noisy importance sampling (IS), i.e., IS working with noisy evaluations of the target density. We present the general framework and derive optimal proposal densities for noisy IS estimators. The optimal proposals incorporate the information of the variance of the noisy realizations, proposing points in regions where the noise power is higher. We also compare the use of the optimal proposals with previous optimality approaches considered in a noisy IS framework.
  • Google AI Blog

    Google Research: Themes from 2021 and Beyond
    (40 min) Posted by Jeff Dean, Senior Fellow and SVP of Google Research, on behalf of the entire Google Research community Over the last several decades, I've witnessed a lot of change in the fields of machine learning (ML) and computer science. Early approaches, which often fell short, eventually gave rise to modern approaches that have been very successful. Following that long-arc pattern of progress, I think we'll see a number of exciting advances over the next several years, advances that will ultimately benefit the lives of billions of people with greater impact than ever before. In this post, I’ll highlight five areas where ML is poised to have such impact. For each, I’ll discuss related research (mostly from 2021) and the directions and progress we’ll likely see in the next few years. …
  • The Official NVIDIA Blog

    NVIDIA Named America’s Best Place to Work on Latest Glassdoor List
    (2 min) NVIDIA is America’s best place to work, according to Glassdoor’s just-issued list of best employers for 2022. Amid a global pandemic that has affected every workplace, NVIDIA was ranked No. 1 on Glassdoor’s 14th annual Best Places to Work list for large US companies. The award is based on anonymous employee feedback covering thousands of Read article > The post NVIDIA Named America’s Best Place to Work on Latest Glassdoor List appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    Q&A: Dolapo Adedokun on computer technology, Ireland, and all that jazz
    (8 min) MIT EECS student and Mitchell Scholar hopes to play music in Dublin while working on his MS in intelligent systems.
    The promise and pitfalls of artificial intelligence explored at TEDxMIT event
    (7 min) MIT scientists discuss the future of AI with applications across many sectors, as a tool that can be both beneficial and harmful.
  • AWS Machine Learning Blog

    Industrial automation at Tyson with computer vision, AWS Panorama, and Amazon SageMaker
    (9 min) This is the first in a two-part blog series on how Tyson Foods, Inc., is utilizing Amazon SageMaker and AWS Panorama to automate industrial processes at their meat packing plants by bringing the benefits of artificial intelligence applications at the edge. In part one, we discuss an inventory counting application for packaging lines. In part […]
  • Becoming Human: Artificial Intelligence Magazine - Medium

    7 Key Questions You Need To Answer Before Implementing Ethical AI
    (4 min) According to ethical AI statistics, 90% of organizations who have adopted ethical artificial intelligence said they know about at least one…
    Google Flu Trends Explained
    (4 min) Introduction
    2022 SEO: A Robust Future!
    (5 min) SEO = Ever-Changing with a Fast-Paced Agenda! Continue reading on Becoming Human: Artificial Intelligence Magazine »
  • John D. Cook

    Random sampling to save money
    (1 min) I was stunned when my client said that a database query that I asked them to run would cost the company $100,000 per year. I had framed my question in the most natural way, not thinking that at the company’s scale it would be worth spending some time thinking about the query. Things have somewhat […] Random sampling to save money first appeared on John D. Cook.
    Greco-Latin squares and magic squares
    (2 min) Suppose you create an n × n Latin square filled with the first n letters of the Roman alphabet. This means that each letter appears exactly once in each row and in each column. You could repeat the same exercise only using the Greek alphabet. Is it possible to find two n × n Latin […] Greco-Latin squares and magic squares first appeared on John D. Cook.
    Latin squares and 3D chess
    (2 min) In a n×n Latin square, each of the numbers 1 through n appears exactly once in each row and column. For example, the 5 × 5 square below is a Latin square. If we placed a rook on each square numbered 1, the rooks would not attack each other since no two rooks would be […] Latin squares and 3D chess first appeared on John D. Cook.
    New R book
    (1 min) Five years ago I recommended the book Learning Base R. Here’s the last paragraph of my review: Now there are more books on R, and some are more approachable to non-statisticians. The most accessible one I’ve seen so far is Learning Base R by Lawrence Leemis. It gets into statistical applications of R—that is ultimately why […] New R book first appeared on John D. Cook.
  • Artificial Intelligence

    Chunkmogrify - Facial Editing AI with Masking
    (0 min) submitted by /u/cloud_weather [link] [comments]
    CMU Researchers Propose A Computer Vision-Based Approach With Data-Frugal Deep Learning To Optimize Microstructure Imaging
    (1 min) Materials processing is the process of turning raw materials into final items through a sequence of phases or “unit operations.” The activities entail a series of industrial processes, including various mechanical and chemical methods, which are often carried out in big numbers or batches. Material processing required extensive analysis and classification of complicated microstructures for quality control. For example, the proportion of lath-type bainite in various high-strength steels affects the material’s characteristics. However, recognizing bainite in microstructural images takes time and money because researchers must first employ two types of microscopy to get a closer look, then rely on their own skills to identify bainitic regions. Continue Reading | Paper submitted by /u/ai-lover [link] [comments]
    Awesome R&D content (with code!) on Computer Vision News of January 2022.
    (1 min) Many great articles about AI, Deep Learning, Computer Vision and more... HTML5 version (recommended) PDF version Dilbert on page 2. Free subscription on page 70. Enjoy! https://preview.redd.it/a0dxqnpqx3b81.jpg?width=700&format=pjpg&auto=webp&s=ef2584405c554b006bb9cc970f4b90686034d8c5 submitted by /u/Gletta [link] [comments]
    Superintelligent Utility Monster thought experiment
    (1 min) submitted by /u/HumanSeeing [link] [comments]
    Any text based games like ai dungeon
    (1 min) submitted by /u/roblox22y [link] [comments]
    Is there a similar AI which is able to make sketches out of images like the one below?
    (0 min) https://youtu.be/Peccbcj8Ibs : submitted by /u/xXLisa28Xx [link] [comments]
    Is Low-Code the Future of Programming?
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    Q-learning and Sarsa in grid environment for short-term vs long-term rewards
    (1 min) I created my custom, grid(7 by 7) environment to apply RL algorithms. I chose Q-learning and Sarsa, in particular. The grid environment consists of 3 types of terminating states: states with negative reward(-100), state with maximum reward(100) and 2 states with half reward(50). The main goal of training is for the agent to avoid states with negative rewards and to prefer long-term reward(100) over short-term half reward(50). The trained agent works weirdly when the half-rewarded state is closer to the main reward, however, if the half-rewarded states are not that close to the main reward, then both algorithms efficiently train the agent to only go to the main reward. So, from what I understand, the result is based on the position of the half-rewarded state. My hyperparameters for both Q-learning and Sarsa are the following: epsilon=1(that is gradually decayed with a linear function),gamma=0.99(I read that for the agent to learn the main reward, the gamma should be high, 0.9-0.99 approx.),alpha=0.1 Can the problem be my environment? I'm confused because both algorithms work well without the half-rewarded state. So the problem is that depending on where the half-rewarded states are located, the algorithms sometimes don't train the agent to choose the long-term reward. If someone had a similar problem, I would really appreciate it if you could share how you solved it. submitted by /u/studentani [link] [comments]
    Predictions about the future and A.I.
    (5 min) As time passes by Artificial Intelligence (A.I.) technology is becoming a bigger thing than ever today. Even though A.I. does come with its benefits like automated machines and etc. I do have a few concerns with A.I. and the negative things about it. A few things I question are firstly self-driving cars. Now I know that many people are taking advantage of this relatively new feature, however, we should consider the fact that this is a machine doing the driving. In most cases this system is reliable, but we should still account for the small number of fatalities and issues that came from it malfunctioning. In addition to that, another thing I would like to discuss is surgeries. I believe that if AI gets to the point where they are providing health care, they should do simple tasks like gr…
  • Machine Learning

    Time Series with high seasonal period [Discussion]
    (0 min) What might be the best model(s) for daily time series with annual seasonality (period = 365) ? I used auto_arima and it doesn’t support high seasonal periods. It takes forever to fit and then throws a memory error. I read a little about Fourier but not sure how to use that in Python. submitted by /u/CheeseBurgersx [link] [comments]
    [D] Interview - This Team won the Minecraft RL BASALT Challenge! (Paper Explanation & Interview with the authors)
    (1 min) https://youtu.be/a4P8v8lGFPw The MineRL BASALT challenge has no reward functions or technical descriptions of what's to be achieved. Instead, the goal of each task is given as a short natural language string, and the agent is evaluated by a team of human judges who rate both how well the goal has been fulfilled, as well as how human-like the agent behaved. In this video, I interview KAIROS, the winning team of the 2021 challenge, and discuss how they used a combination of machine learning, efficient data collection, hand engineering, and a bit of knowledge about Minecraft to beat all other teams. ​ OUTLINE: 0:00 - Introduction 4:10 - Paper Overview 11:15 - Start of Interview 17:05 - First Approach 20:30 - State Machine 26:45 - Efficient Label Collection 30:00 - Navigation Policy 38:15 - Odometry Estimation 46:00 - Pain Points & Learnings 50:40 - Live Run Commentary 58:50 - What other tasks can be solved? 1:01:55 - What made the difference? 1:07:30 - Recommendations & Conclusion 1:11:10 - Full Runs: Waterfall 1:12:40 - Full Runs: Build House 1:17:45 - Full Runs: Animal Pen 1:20:50 - Full Runs: Find Cave ​ Paper: https://arxiv.org/abs/2112.03482 Code: https://github.com/viniciusguigo/kairos_minerl_basalt Challenge Website: https://minerl.io/basalt/ ​ Paper Title: Combining Learning from Human Feedback and Knowledge Engineering to Solve Hierarchical Tasks in Minecraft submitted by /u/ykilcher [link] [comments]
    [D] Build system for machine learning?
    (1 min) Hey all, I'm setting up a new machine learning project and I know I'm going to be collecting data over time (and making model improvements). Is there a best way to set up a build system to generate models using my training script, eval on test data, all relatively automatically? Curious what other people have tried and liked (or disliked). Thanks! submitted by /u/smp2005throwaway [link] [comments]
    [D] Significance of MLM loss when pre-training Transformers for language modeling
    (1 min) What significance does the MLM loss have when I'm pre-training Transformers for language modeling from scratch or continue the training of a pre-trained model on a different dataset? Apart from initial spikes, I don't really see any significant movement in the loss curves. Most papers just evaluate on downstream tasks such as NER or NLI. Is the MLM loss really not that well interpretable? submitted by /u/optimized-adam [link] [comments]
    [D] How is confidence bound derived for LinUCB
    (2 min) Hi! I am reading the paper A Contextual-Bandit Approach to Personalized News Article Recommendation. I struggle to understand how the confidence bound is derived: In particular, I do not understand the term under the square root. Why is this a standard deviation? Also, what values does alpha take or correspond to. Would appreciate any hints/advice. I did read the paper, but could not get the interpretation for the derivation of the upper confidence bound. Thanks https://i.redd.it/2bkd1rkro2b81.gif Latex code for the picture |x_{t,a}^T\hat{\theta_a} - E[r_{t,a}|x_{t,a}]| \leq \alpha \sqrt{x_{t,a}^T(D_a^TD_a + I_d)^{-1}x_{t,a}} \\~\\ \text{For any }\delta > 0 \text{ and } x_{t,a} \in R^d \text{, where } \alpha = 1 + \sqrt{ln(2/\delta)/2} \\~\\ a_t = \arg \max_{a \ in A_t}(x_{t,a}^T\hat{\theta_a} + \alpha \sqrt{x_{t,a}^T(D_a^TD_a + I_d)^{-1}x_{t,a}}) submitted by /u/denis56 [link] [comments]
    [P] txtai 4.0 released - semantic search with SQL, content storage, object storage, reindexing and more
    (1 min) txtai 4.0 has been released with a number of new features. https://preview.redd.it/qm9tgjkum2b81.png?width=1061&format=png&auto=webp&s=22c90382d1dee780530e69635242078a26b8305c Content Storage - Content can now be stored alongside embeddings vectors. No longer required to have external data storage. Query with SQL - txtai supports both natural language queries and SQL queries embeddings.search("feel good story") SELECT id, text, score FROM txtai WHERE similar('feel good story') AND score >= 0.15 Object Storage - Store binary objects alongside embeddings vectors Reindex - Indexes can be rebuilt using stored content, no need to resend data to txtai Index Compression - Indexes can be compressed using GZ/XZ/BZ2/ZIP External Vectors - Use external vector models from an API or an external library. Centralize building vectors on GPU servers leaving index servers to be powered by more modest hardware. More information can be found in following links. GitHub Project 4.0 Release Notes What's new in txtai 4.0 Documentation Examples submitted by /u/davidmezzetti [link] [comments]
    [Project] A dataset of parse trees generated from abstracts of arXiv articles
    (1 min) (Admittedly cross-posting from r/LanguageTechnology) https://github.com/l74d/scholarly-trees I have put up some (not so few) parse trees online as a dataset. Not something as substantial as Penn Treebank, since the trees have not been human-edited. But it is still a lot more parse trees, than those from Penn, to feed into your later-stage NLP algorithms, free of charge or hassle. The current format is straight from where they were generated. Suggestions of alternative formats based on ease of use would be heavily appreciated! submitted by /u/l74d [link] [comments]
    [R] "Hot under the collar: A latent measure of interstate hostility"
    (1 min) Tl; dr: A Bayesian ideal-point model for modeling crises in international relations. Abstract: "The majority of studies on international conflict escalation use a variety of measures of hostility including the use of force, reciprocity, and the number of fatalities. The use of different measures, however, leads to different empirical results and creates difficulties when testing existing theories of interstate conflict. Furthermore, hostility measures currently used in the conflict literature are ill suited to the task of identifying consistent predictors of international conflict escalation. This article presents a new dyadic latent measure of interstate hostility, created using a Bayesian item-response theory model and conflict data from the Militarized Interstate Dispute (MID) and Phoenix political event datasets. This model (1) provides a more granular, conceptually precise, and validated measure of hostility, which incorporates the uncertainty inherent in the latent variable; and (2) solves the problem of temporal variation in event data using a varying-intercept structure and human-coded data as a benchmark against which biases in machine-coded data are corrected. In addition, this measurement model allows for the systematic evaluation of how existing measures relate to the construct of hostility. The presented model will therefore enhance the ability of researchers to understand factors affecting conflict dynamics, including escalation and de-escalation processes." Paper: https://journals.sagepub.com/doi/pdf/10.1177/0022343320962546 Non-paywalled: http://zterechshenko.com/assets/ZTerechshenko_MA.pdf submitted by /u/bikeskata [link] [comments]
    [D] large query set few shot learning
    (1 min) Is there a research field specific to image classification/ embedding learning where we have a large number of classes +2k and only few samples per class ~1-5 ? What i noticed in few shot learning papers is the query set is always very small 20-way 1-shot or something similar. Is there any paper you have seen that i can check ? Thank you! submitted by /u/flow_smith [link] [comments]
    [P] Tiny Video Search Engine Using OpenAI's CLIP
    (1 min) Tiny Video Search Engine Using OpenAI's CLIP A fun project that I did to try out OpenAI's CLIP model. In this article, I describe a tiny video search engine and indexer that will let you search through a video with descriptive "natural language" queries and find matching frames of video. All the code is included in a Google Colab Notebook. So even if you don't have your own cuda-capable GPU, you can easily run the code yourself without setting up anything on your own computer. submitted by /u/CakeStandard3577 [link] [comments]
    [P] TensorFlow Similarity now self-supervised training
    (1 min) Happy new year :) Very happy to announce that as part of the 0.15 release, TensorFlow Similarity now support self-supervised learning using STOA algorithms. To help you get started we included in the release a detailed getting started notebook that you can run in Colab. This notebook shows you how to use SimSiam self-supervised pre-training to almost double the accuracy compared to a model trained from scratch on CIFAR 10. Hope you will find this release useful to your research and experimentations :) ​ https://preview.redd.it/ulodxbq43za81.jpg?width=1600&format=pjpg&auto=webp&s=188a51095ea122573458856db9a524491743a587 submitted by /u/ebursztein [link] [comments]
    [R]:Twitter Crawl Data for Research.
    (1 min) Hello All I am looking to crawl data for academic research (most likely need to release/open-source the dataset). conclude you guys know the license? (I have already read their webpage, terms and condition), however, I don't find too many open source twitter data set, wondering if there ‏‏‎ is any hidden terms that I am not awared off? submitted by /u/hushedBobolink7 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Chunkmogrify - Facial Editing AI with Masking
    (0 min) submitted by /u/cloud_weather [link] [comments]
    From 53% to 95% acc - Real vs Fake Faces Classification | Fine-tuning EfficientNet (Github in comment)
    (1 min) submitted by /u/oFlamingo [link] [comments]
    How do I make a Neural Net AI with Cheat Engine?
    (2 min) I saw somebody explain how they made an AI for Street Fighter 5 via Cheat Engine, but I am unsure how they exactly managed this. I’m making one for a different game, can someone please help me? submitted by /u/GokuKing922 [link] [comments]
  • Reinforcement Learning

    Immediate reward and Final reward
    (1 min) Hello guys I hope you are doing well. I am struggling with designing a deep reinforcement learning model, I want to make a model that works with immediate rewards and a final reward at the end of each episode. I am wondering if the final reward must have the same distribution as the immediate rewards? because when I designed an immediate reward that is in the range of [-3,-2] and a final reward in the range of [-1,0] , the agent learns to minimize the reward as the figure below. My second question is how the agent differs the final reward and the immediate one? Thank you ! ​ https://preview.redd.it/pggvt2rzk4b81.png?width=467&format=png&auto=webp&s=d90b250c2242bc9d8936c8958f3a8258fd92f70a submitted by /u/GuavaAgreeable208 [link] [comments]
    How to improve PPO on BipedalWalker-v3?
    (1 min) I implemented basic PPO on OpenAI's BipedalWalker-v3 and after tuning hyperparameters i got maximum mean of 230 over last 100 episodes. Do you know some improvements to my basic PPO implementation I can use to get better result or closer to reward of 300? submitted by /u/TheGuy839 [link] [comments]
    Is there any method to obtain the true optimal Q-function for low dimensional continuous problem (for exmaple, cartpole)?
    (1 min) submitted by /u/Rich_Beautiful [link] [comments]
    First try at Blogging! An Introduction to Concepts in Reinforcement Learning
    (1 min) Tried my hand at blogging after procrastinating a lot. Hope y'all find it interesting Introduction to Concepts in Reinforcement Learning For any corrections, criticisms or clarifications feel free to hmu :) submitted by /u/needANewPlague_ [link] [comments]
    What advantage does Actor critic methods /Deep RL give over Naive Policy gradient algorithm like Reinforce.
    (1 min) submitted by /u/aabra__ka__daabra [link] [comments]
    help understanding ppo algorithm
    (2 min) i'm having a hard time understanding the PPO algorithm (intuitively). here are the questions i have: what is the gist of calculating the advantage? i know that it's calculating the "betterness" of the state/action compared to what the network was actually expecting, but then why is the advantage formula: reward + values_of_next_state - value_of_current_state? how does this tell us how advantageous the current state/action is? if the value network is supposed to output the expected/discounted returns (i hope i am not wrong here, or is it supposed to return the reward for just the current state?) then how does it stack up against continuous obs spaces? (eg: in a car game where the input is pixels, i can sample a random state from the whole game then how is it supposed to calculate the discounted returns? it doesn't know how far the map goes further or for that matter how far it is from the start of the track). how does the gradient affect the action? meaning in a multi discrete action space how does the gradient affect only the action that i took? instead of backproping the grad to all actions? and finally this is how i'm currently calculating returns and advantages (for a continuous action space and pixel based inputs to the CNN): self.values[self.ptr] = next_value deltas = self.rewards + self.gamma * self.values[1:] - self.values[:-1] self.advantage = self.discounted_sum(deltas, self.gamma * self.lam) self.returns = self.discounted_sum(self.rewards, self.gamma) am i doing it right? submitted by /u/jedi1026 [link] [comments]
    How do I use a Baselines algorithm such as A2C or PPO, but with a custom reward function? (OpenAI Retro)
    (1 min) Hi. I used neat-python to make an AI for Pokemon Red, but it doesn't get very far. The reward function I made gives it 10 reward every time the RAM values change, as checked every 10 frames. (I made a list of what RAM values it should watch for). I did this because I wanted to try a "curiosity" reward. Since the NEAT AI isn't getting very far, I decided to try a different algorithm that is not genetic, hoping that it will perform better. I have my eyes on A2C and PPO but I cannot find a way to make a custom reward function for them. It seems that they use the environment's reward function, which seems to be only editable in Lua. Can someone give me pointers on how to implement a custom reward function for reinforcement learning that is not NEAT? I just need it to take in a list of inputs, output a list, and learn from those and the rewards it gets. I've tried to code the reward function in Lua but I was having issues, so I'd prefer it to be in Python. submitted by /u/Unsightedmetal6 [link] [comments]
    Whats the difference between feature vectors and state space reinforcement learning?
    (1 min) I encountered both words frequently. They seem to refer to same things. I am so confused. Anybody care to help clarify? Thanks. submitted by /u/Asleep_Donut1382 [link] [comments]

2022-01-10

  • The Official NVIDIA Blog

    Leading HPC Software Company Bright Computing Joins NVIDIA
    (2 min) Bright Computing, a leader in software for managing high performance computing systems used by more than 700 organizations worldwide, is now part of NVIDIA. Companies in healthcare, financial services, manufacturing and other markets use its tool to set up and run HPC clusters, groups of servers linked by high-speed networks into a single unit. Its Read article > The post Leading HPC Software Company Bright Computing Joins NVIDIA appeared first on The Official NVIDIA Blog.
    AI Startup Speeds Up Derivative Models for Bank of Montreal
    (3 min) To make the best portfolio decisions, banks need to accurately calculate values of their trades, while factoring in uncertain external risks. This requires high-performance computing power to run complex derivatives models — which find fair prices for financial contracts — as close to real time as possible. “You don’t want to trade today on yesterday’s Read article > The post AI Startup Speeds Up Derivative Models for Bank of Montreal appeared first on The Official NVIDIA Blog.
    Cloud Control: Production Studio Taylor James Elevates Remote Workflows With NVIDIA Technology
    (3 min) WFH was likely one the most-used acronyms of the past year, with more businesses looking to enhance their employees’ remote experiences than ever. Creative production studio Taylor James found a cloud-based solution to maintain efficiency and productivity — even while working remotely — with NVIDIA RTX Virtual Workstations on AWS. With locations in New York, Read article > The post Cloud Control: Production Studio Taylor James Elevates Remote Workflows With NVIDIA Technology appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    Physics and the machine-learning “black box”
    (6 min) In 2.C161, George Barbastathis demonstrates how mechanical engineers can use their knowledge of physical systems to keep algorithms in check and develop more accurate predictions.
  • AWS Machine Learning Blog

    Develop an automatic review image inspection service with Amazon SageMaker
    (8 min) This is a guest post by Jihye Park, a Data Scientist at MUSINSA.  MUSINSA is one of the largest online fashion platforms in South Korea, serving 8.4M customers and selling 6,000 fashion brands. Our monthly user traffic reaches 4M, and over 90% of our demographics consist of teens and young adults who are sensitive to […]
    How ReliaQuest uses Amazon SageMaker to accelerate its AI innovation by 35x
    (5 min) Cybersecurity continues to be a top concern for enterprises. Yet the constantly evolving threat landscape that they face makes it harder than ever to be confident in their cybersecurity protections. To address this, ReliaQuest built GreyMatter, an Open XDR-as-a-Service platform that brings together telemetry from any security and business solution, whether on-premises or in one or multiple clouds, to unify detection, investigation, response, and resilience. In 2021, ReliaQuest turned to AWS to help it enhance its artificial intelligence (AI) capabilities and build new features faster.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    ML-Based Energy Consumption Forecasting
    (4 min) Machine Learning has diverse applications in business, but it is probably the most effective for forecasting the future demand of a…
    Gradient-Based Optimization Demistified
    (4 min) In real life, you often have to deal with things you don’t completely understand. For instance, you drive a car, not knowing how the engine…
    How to achieve data interoperability in healthcare: tips from ITRex
    (11 min) Health data is notoriously difficult to share. Due to its sensitive nature, it requires more privacy and security than any other data…
  • John D. Cook

    Computational asceticism
    (3 min) A while back I wrote about computational survivalism, being prepared to work productively in a restricted environment. The first time I ran into computational survivalism was when someone said to me “I prefer Emacs, but I use vi because I only want to use tools I can count on being installed everywhere.” I thought that […] Computational asceticism first appeared on John D. Cook.
  • Artificial Intelligence

    Weekly China AI News: Alibaba Losses Head of Self-Driving Unit; AI Creates Images from Text, and Vice Versa; Geely, Mobileye to Build Self-Driving EV for Consumer
    (1 min) submitted by /u/trcytony [link] [comments]
    Learn Machine Learning from ground up!
    (0 min) Hi! This is my attempt to demystify the minutae of Machine Learning and Deep Learning. The goal is for the information to be complete and intuitive. https://youtube.com/channel/UC6sjv3MMPEoFCioD6-Ham4w I'd love to hear feedbacks. Please feel free to DM. Also, please share and subscribe if you like the content. MachineLearning DeepLearning #MachineLearningwithHarsh #YouTube submitted by /u/mr-minion [link] [comments]
    UC Sandiego Researchers Propose A Controllable Voice Cloning Method That Allows Fine-Grained Control Over Various Style Aspects Of The Synthesized Speech For An Unseen Speaker
    (1 min) Text-to-Speech (TTS) synthesis is achieved using current voice cloning methods for a new voice. They do not, however, manipulate the expressiveness of synthesized sounds. The task of learning to synthesize the speech of an unseen speaker with the least amount of training is known as voice cloning. UC San Diego researchers propose a Controllable voice cloning method that offers fine-grained control over many style features of synthetic speech for an unseen speaker. The voice synthesis model is explicitly conditioned on a speaker encoding, pitch contour, and latent style tokens during training. Continue Reading Paper: https://arxiv.org/pdf/2102.00151.pdf submitted by /u/ai-lover [link] [comments]
    Last Week in AI - cuddly robo-dogs, self-farming farms, AI-crafted craft beer recipes, and more!
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    [R] Counterfactual Memorization in Language Models: Distinguishing Rare from Common Memorization
    (1 min) A team from Google Research, University of Pennsylvania and Cornell University proposes a principled perspective to filter out common memorization for LMs, introducing "counterfactual memorization" to measure the expected change in a model’s prediction and distinguish “rare” (episodic) memorization from “common” (semantic) memorization in neural LMs. Here is a quick read: Counterfactual Memorization in Language Models: Distinguishing Rare from Common Memorization. The paper Counterfactual Memorization in Neural Language Models is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    REPEATING ITS SELF
    (1 min) so u ever notice that when something is said it goes popular and it circles around social media, as a meme or a popular thing in social media is mostly fake .. ​ as if its ai intelligence repeating its self just like our thoughts, is being repeated by movies or shows or new paper we read on line most of the shit we say as already been said.. its already been done for real yo submitted by /u/toppsick [link] [comments]
    Could AI labor turn tables and spark radical changes of working hierarchies?
    (1 min) In a working environment quite often errors are accounted to the contingent/individual because it is considered to be the cheapest fix to blame one rather than rethinking an entire process. Middle-end employees - like project managers - won't have anyone to blame if bottom-end employees are all replaced by - not infallible but fair - AI . They'll find themselves working for AI and not the other way around. By breaking the the blame cycle, this "fairness" responsibility might backpropagate higher than bottom to mid-end and induce a more radical systematical change. submitted by /u/MariadocBrandybuc88 [link] [comments]
    Top 10 Speech-to-Text APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    Do you think "Intelligent" robots like sophia will become relatively more common in the future? When?
    (2 min) Do you think humanity will see a future where people don't bat an eye at the thought of robots of this kind? submitted by /u/Fantasyneli [link] [comments]
    Artificial Intelligence and Education
    (1 min) Very interesting to see this technology become more popular and spread. Friends have been using tools like hyperwrite and speedwrite to get ideas for their writing and assist with drafts, and it seems like more and more people are talking about this as the technology becomes more accessible. It will be interesting to see how this technology will impact the future of education, and curious if any students (or others) have tried writing tools like this? Just started getting really exposed to this stuff and loving it! submitted by /u/aiguy2 [link] [comments]
    Researchers From China Propose A Pale-Shaped Self-Attention (PS-Attention) And A General Vision Transformer Backbone, Called Pale Transformer
    (1 min) Transformers have recently demonstrated promising performance in a variety of visual tests. Inspired by Transformer’s success on a wide range of NLP tasks, Vision Transformer (ViT) first employed a pure Transformer architecture for image classification, demonstrating the promising performance of Transformer architecture for vision tasks. However, the quadratic complexity of global self-attention leads to high computing costs and memory use, particularly for high-resolution situations, rendering it unsuitable for use in diverse visual tasks. Various strategies confine the range of attention inside a local region to increase efficiency and lower the quadratic computing complexity generated by global self-attention. As a result, their receptive fields in a single attention layer are insufficiently big, resulting in poor context modeling. A new Pale-Shaped self-Attention (PS-Attention) method executes self-attention inside a pale-shaped zone to solve this issue. Compared to global self-attention, PS-Attention can considerably lower compute and memory expenses. Meanwhile, it can collect more fantastic contextual information while maintaining the same computational complexity as earlier local self-attention techniques. Continue Reading Paper: https://arxiv.org/pdf/2112.14000v1.pdf Github: https://github.com/BR-IDL/PaddleViT submitted by /u/ai-lover [link] [comments]
  • Machine Learning

    Dynamic Time Series Chunk Regression Problem [D]
    (1 min) Hello all, I currently have a issue and I'd like to hear some advice on how to tackle the given problem. I'm running independent trials, where for each trial a series of variables are recorded at each timestep, where each trial lasts for different periods of time. The outcome of each trial is a single value. I'd like to predict this single output value given the dynamic timesteps for time series variables. To give a visual example, here is what some data would look like: ​ https://preview.redd.it/ssz9dlb74ya81.png?width=704&format=png&auto=webp&s=da8cd49520e5570e45bdc85ff5e7cf9f7ff75ef6 Each of the first 8 columns represents a variable, where each row represents a different timestep, and each blank row separates the timesteps into chunks, where each chunk is a trial; and the last column is the single value resulting from each trial that I'd like to predict. Any advice on how to tackle this problem of predicting the last column given a dynamic amount of timesteps for a set of timeseries variables??? submitted by /u/MyActualUserName99 [link] [comments]
    [R] Model Selection in Batch Policy Optimization
    (0 min) submitted by /u/hardmaru [link] [comments]
    [D] Is there a solid (aka non euristic) reason for why smaller batch sizes lead to better generalization?
    (2 min) I mean a mathematical argument. Thanks everyone submitted by /u/alesaso2000 [link] [comments]
    [D] Why is one loss positive and the other loss negative in a multi-output branch?
    (1 min) I was reading section 3.1.D of this paper on re-id and I had a question about this part: image of relevant part Specifically, both parts are minibatch losses. So why is the 1/Nb loss negative, and why would the 1/Ns loss be positive. Intuitively, the losses are summed, so it should be (-1/Nb)+(-1/Ns) = -1/Nb - 1/Ns, right? submitted by /u/asuprem [link] [comments]
    [P] A library for visualizing (CNN) architectures and receptive field analysis
    (1 min) Hi! In my PhD, I studied the design of neural architectures and how to design them without tremendous amounts of trial and error.This lib is my most significant insight compiled into a lightweight package.Simply speaking, I figured out how to predict useless / unproductive layers in CNN architectures. This library allows you to easily visualize neural architectures from PyTorch, with unproductive layers highlighted within in the topology. This makes it possible for you to spot inefficiencies within your CNN architecture reliably, without the need for a single training step! GitHub PyPi Doc submitted by /u/KrakenInAJar [link] [comments]
    [D] Lessons From the Field in Building Your MLOps Strategy with Harpreet Sahota, Data Scientist at Comet at Enterprise Data & AI - Jan 27 @ 12 PM ET
    (1 min) Hi r/MachineLearning! I wanted to share a webinar coming up in January 2022 at Enterprise Data & AI. I put the info from the website below along with the link if you're interested. I'm really interested in the real life case studies they mention with Uber. --------- Featured Speaker: Harpreet Sahota, Data Scientist at Comet and host of "The Artists of Data Science" Podcast As machine learning expands and larger organizations begin deploying across bigger teams, the need to efficiently operationalize becomes critical for enterprises. In our discussions with leading organizations utilizing ML like The RealReal and Uber, we have compiled real-world case studies and organizational best practices for MLOps in the enterprise. Join us for a discussion where we'll explore the benefits of MLOps and discuss when and how to deploy MLOps in your ML. We'll review three real world case studies that will answer key questions: When to start implementing in MLOps? How to start implementing in MLOps? How to measure the value of your MLOps strategy? Agenda: 12:00pm-12:30: Featured Presentation 12:30-13:00pm: Your Q&A and interaction Link to the website: https://events.cognilytica.com/CLNjE3MHwyNA submitted by /u/DataGeek0 [link] [comments]
    [P] Dissecting and implementing research papers in python
    (1 min) Hi, I wrote a 2 part article on creating interaction networks between characters in novels and other bodies of text. The first part is a detailed literature review outlining and dissecting research papers relevant to the topic and the second part is the implementation of the thoughts and ideas presented in the first part (in python). Check it out if you're interested. Part 1 : https://towardsdatascience.com/mining-modelling-character-networks-part-i-e37e4878c467 Part 2 : https://towardsdatascience.com/mining-modelling-character-networks-part-ii-a3d77de89638 submitted by /u/spidermon97 [link] [comments]
    [P] Library for end-to-end neural search pipelines
    (2 min) Hello everyone ! While working on my PhD, I developed a Python package that I am pretty proud of. It allows you to easily create diverse neural search pipelines with retrievers and pre-trained language models as rankers, and it works flawlessly with middle-sized corpus. It is also currently trending #15 on HackerNews ! Check-it out ! Github link Documentation Hackernews link submitted by /u/RaphaelYt [link] [comments]
    [D] What are your favorite tools to visualize/explain tensor operations?
    (1 min) Oftentimes I encounter operations that I do not understand as a whole - which can be fixed by simply using dummy arrays and printing them out. However, that quickly becomes cumbersome. What do y'all use to visualize such complex tensor operations quickly and accurately to understand what exactly it accomplishes? submitted by /u/Competitive-Rub-1958 [link] [comments]
    [D] How to speed up inference of your Transformer-based NLP models?
    (2 min) Hello all, Many of us are having a hard time speeding up our Transformer based NLP models for inference in production. So I thought it would be worth writing an article that summarizes the options one should consider (GPU, batch inference, export to ONNX or Torchscript, using TensorRT, Deepspeed, Triton Inference Server... etc.): https://nlpcloud.io/how-to-speed-up-deep-learning-nlp-transformers-inference.html I hope you'll find it useful. If you can think of additional options, please let me know and I'll add them to the article! Julien submitted by /u/juliensalinas [link] [comments]
    [D]You're working against a tight deadline and need to deliver a project quickly. Would you consider speeding up your model training by using only a subset of the available training data?
    (4 min) View Poll submitted by /u/rrpelgrim [link] [comments]
    [D]What would happen if I removed the feed-forward layer in the transformer architecture?
    (1 min) I think it would still work albeit less efficient? submitted by /u/Sudden-Lingonberry80 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    How to choose optimizer and loss— Keras Python neural network
    (1 min) Hi guys, I just programmed my first neural network in Python using keras, but I’m not sure what loss functions and optimizer to use. I used mean squared error, which worked well, but I only have experience in the past using r2, so it’s not optimal. I am currently using ‘Adam’ as my optimizer, but I really don’t understand what this does. I am currently doing regression 28 input neurons leading to 1 output neuron, if that’s helpful. Thanks in advance for the help! submitted by /u/Tvdybgggh [link] [comments]
    Smart Video Generation from Text Using Deep Neural Networks
    (0 min) submitted by /u/BraveOutage [link] [comments]
    Introduction to GNN
    (1 min) Hi All, I would like to learn about Graphical Neural Network. Can you tell me a good starting paper or webpage for basic understanding and would also like to have an example project to understand the functionality submitted by /u/Shocky698 [link] [comments]
    VAE: CIFAR-10 & PyTorch - loss not improving
    (1 min) I have implemented a Variational Autoencoder using Conv-6 CNN (VGG-* family) as the encoder and decoder with CIFAR-10 in PyTorch. You can refer to the full code here. The problem is that the total loss (= reconstruction loss + KL-divergence loss) doesn't improve. Also, the log-variance is almost 0 indicating further that the multivariate Gaussians being mapped in the latent space is not happening as expected, since the log variance should have values between say -4 to +3, etc. You can see this in this code where the log variance is changing and has a non-zero value. Suggestions to alleviate the situation? submitted by /u/grid_world [link] [comments]
  • Reinforcement Learning

    Multiple Action Spaces
    (2 min) Hey guys! I´m wondering if we can apply Reinforcement Learning where an agent has two action spaces and at time t the agent has to select one of the action spaces (A or B). If it selects A there are n tasks that the agent must select one of them. The same for action space B, there are m choices that the agent must select among them. It´s a complicated problem. All I want is to know if this is possible? submitted by /u/LeatherCredit7148 [link] [comments]
    Q-learning with short-term vs long-term rewards
    (2 min) Hey guys, I have implemented and applied the Q-learning algorithm to the simple, grid environment. I defined terminating states, one with positive reward and others with negative rewards. And the training worked pretty well. Now, I wanted to enhance the process by adding the state with half reward,i.e., now there are 3 types of terminating states - a state with the biggest positive reward(100), a state with half of the reward(50), and the state with negative rewards(-100). As I said, they all terminate the process. However, when I test the trained agent, sometimes the agent goes to the state with half reward. So, the process is not as efficient as it was before. Could someone give me any tips on how to approach this problem? Is Q-learning a good approach for short-term vs long-term rewards? Thank you in advance! submitted by /u/studentani [link] [comments]
    Can you list all the US universities that you know that do research in reinforcement learning?(other than Stanford/MIT/Berkeley/UW/CMU)
    (1 min) Thank you. submitted by /u/Ok-Reaction-1515 [link] [comments]
    Question: Why does my robot arm curls up?
    (1 min) Hi, I have an issue with my algorithm which i can't seem to find where the issue lies at. I'm using RLBench for the environment with Option Critic and DDPG as the agent. I'm trying out the reach target task. I'm not sure why but when i looked at the results, the arm ends up curling/trying to return to its initial position when it tries to solve the task at hand. This happened for the open box task as well where it tries to reach the box lid to open it before it swerves and curls up the robot's base. Does anyone have any insight as to what is causing this issue? https://reddit.com/link/s09668/video/7xwaris1zra81/player https://reddit.com/link/s09668/video/44dauc99yra81/player submitted by /u/Temptedtosleep [link] [comments]
    How different are the code in unity and Isaac sim
    (1 min) I am trying to make RL algorithm testing on Unity but thinking of changing to Isaac gym, how different is the code? Do I have to change a lot? submitted by /u/baegyutae [link] [comments]

2022-01-09

  • John D. Cook

    Beatty’s theorem
    (3 min) Here’s a surprising theorem [1]. (Beatty’s theorem) Let a and b be any pair of irrational numbers greater than 1 with 1/a + 1/b = 1. Then the sequences { ⌊na⌋ } and { ⌊nb⌋ } contain every positive integer without repetition. Illustration Here’s a little Python code to play with this theorem.We set a […] Beatty’s theorem first appeared on John D. Cook.
    Corner quotes in Unicode
    (2 min) In his book Mastering Regular Expressions, Jeffrey Friedl uses corner quotes to delimit regular expressions. Here’s an example I found by opening his book a random: ⌜(\.\d\d[1-9]?)\d*⌟ The upper-left corner at the beginning and the lower-right corner at the end are not part of the regular expression. This particularly comes in handy if a regular […] Corner quotes in Unicode first appeared on John D. Cook.
  • Artificial Intelligence

    Topics for Debates on AI
    (1 min) Hello! I'm a high school computer science teacher, and teach a course on computer ethics. One of my units is on A.I. and I want to conclude the unit with student debates on topics in AI. I'm struggling to come up with topic statements however. I know for sure I want one of the topics to be centered on whether A.I. at an advanced level should be afforded the same rights as humans. Any other topic statement ideas? Thanks! submitted by /u/CT_History_Teacher [link] [comments]
    total noob here, looking for a program to creat AI art
    (1 min) hey there! i am a big fan of AI generated images and would really love to start messing with my "own" AI generated art, but it looks like things like OpenAI have a pretty rough barrier for entry, i am fairly knowledgeable with computers but i dont have any real programming/coding skills, can anyone recommend a program i can use to both train and generate AI art? or point me in the right direction? i really want something that does more than the browser apps that i can hopefully train (is that the right word?) myself on specific images submitted by /u/Chickenwomp [link] [comments]
    Apple ML Researchers Introduce ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
    (1 min) Understanding indoor 3D scenes are becoming increasingly important in augmented reality, robotics, photography, games, and real estate. Many state-of-the-art scene interpretation algorithms have lately been driven by modern machine learning approaches. Depth estimation, 3D reconstruction, instance segmentation, object detection, and other methods are used to address distinct aspects of the problem. The majority of these studies are made possible by a range of real and synthetic RGB-D datasets that have been made available in recent years. Even though commercially accessible RGB-D sensors, such as Microsoft Kinect, have made the collection of such datasets possible, capturing data at a significant scale with ground truth is still a problematic issue. Continue Reading Paper: https://arxiv.org/pdf/2111.08897.pdf Github: https://github.com/apple/ARKitScenes https://preview.redd.it/nup8uxuanpa81.png?width=1920&format=png&auto=webp&s=c6405654ed77cb50c0471025963cbe77bee2ea57 submitted by /u/ai-lover [link] [comments]
    [E-Book] INSIDE ISAAC ASIMOV: QUOTES & CONTEMPLATIONS
    (1 min) Isaac Asimov published more than 500 books (Foundation Series, I. Robot, ...) during his lifetime. Asimov is best known as one of the grandmasters of science fiction. He also wrote textbooks and scientific studies that inspires many scientists and AI Researcher. As a member of Mensa International, his genius was recognized worldwide. "The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom." ~ Isaac Asimov The book contains some of his most inspiring quotes and thoughts. Be Inspired! INSIDE ISAAC ASIMOV: QUOTES & CONTEMPLATIONS (Amazon) submitted by /u/Philo167 [link] [comments]
    Brain Efficiency: Much More than You Wanted to Know
    (0 min) https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know submitted by /u/Singularian2501 [link] [comments]
    Is there a AI similar to this (editing images to make black sketches out of them)?
    (0 min) https://youtu.be/I4omT2L9aI8?t=759 https://www.youtube.com/watch?v=Peccbcj8Ibs&t=142s https://drawingbotv3.readthedocs.io/en/latest/quickstart.html I want to plott selfies from my girlfriend with a pen plotter. The images should look like they are drawn. submitted by /u/xXLisa28Xx [link] [comments]
    Training a zombie via reinforcement learning for a video game
    (1 min) submitted by /u/floridianfisher [link] [comments]
  • Machine Learning

    [R] RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP
    (0 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    [D] Are there any ML groups in Boston?
    (1 min) Hi, I am a graduate student and I recently moved to Boston. I was wondering if there are any ML/DL/RL or related groups in Boston that host meetups in and around the city. Thank you. ​ P.S: If you are from Boston and interested in any of these fields, then I would love to connect with you. Please send a DM or text me in reddit chat :) submitted by /u/frankhart98 [link] [comments]
    [R] Sensing Depth with 3D Computer Vision - Link to a free online lecture by the author in comments
    (2 min) submitted by /u/pinter69 [link] [comments]
    [D] Does anyone else think open source code/examples in machine learning domain usually are not as readable as they could be? Specifically use of magic numbers.
    (3 min) Admittedly, I am not an expert in machine learning or different libraries but the code I see as an example is not really beginner friendly. Even for an expert, I am not sure, they know all libraries and quircks of different datasets. Let me elaborate. The main problem I see is the use of magic numbers. For example, in below hypothetical code x = dataset[1] there is no indication of why 1 is used instead of 0 or what does it mean. May be 0th elemnt contains metadata/some useless data. Or in other cases, some axis is chosen without specifying why that is used and what are other axis to put in context. My only suggestion would be to not ever use a magic number unless it is immediately obvious. Can we not use an appropriately named constant in that case? MY_DATA_INDEX=1 x = dataset[MY_DATA_INDEX] I believe this is a very simple and helpful convention to follow. If such conventions are already there, can someone point me to then? May be people aren't just using them too often. submitted by /u/junovac [link] [comments]
    [D] The Future of Machine Learning and why it looks a lot like Julia 🤖
    (4 min) I published an article on Towards Data Science a few weeks ago about why Julia is the future of ML: https://towardsdatascience.com/the-future-of-machine-learning-and-why-it-looks-a-lot-like-julia-a0e26b51f6a6 ​ Following up on this post, I will be hosting a Twitter Spaces tomorrow (Monday). If you are interested in joining the discussion, you can find out more in this tweet: https://twitter.com/JuliaLanguage/status/1479899084261572609?s=20 submitted by /u/LoganKilpatrick1 [link] [comments]
    [D] Best tools for Multi-GPU model training?
    (2 min) Hi everyone, until recently I only had to work on problems for which a single GPU training setup would suffice. But i am working on a problem with a large dataset and I have access to multiple GPUs so I was wondering what's the best way to set this up. Going through the PyTorch documentation, they seem to suggest using torch.distributed.run along with DistributedDataParallel for distributed training although I also came across libraries in which the boilerplate stuff is abstracted out such as PyTorch Lightening or DeepSpeed. Since I'm new to this, I just wanted an opinion on the route I should choose for training my Transformer models in a distributed way. Thanks in advance! submitted by /u/Areyy_Yaar [link] [comments]
    [D] Looking for open source projects to contribute
    (3 min) Hi all, For the last 6 months I have immersed in deep learning problem domain, and have been spending a lot of time catching up on the literature, few courses and doing some personal projects as well. But now, I'm at the point where I'd like to start contributing in a more meaningful way to the community. Does anyone have idea of good open source projects related to DL (maybe even classic machine learning) that are looking for contributors? Thanks for any suggestions! submitted by /u/wreemde [link] [comments]
  • Reinforcement Learning

    What is the role of target actor and target critic?
    (2 min) Could you explain the role of target actor and target critics in DRL? Thanks submitted by /u/No_Possibility_7588 [link] [comments]
    Smoother movements for robot-based RL
    (2 min) Hello there, I am working on a RL environment where an agent moves around in a certain area. At a certain time step, the agent needs to choose the direction of movement (out of 8 choices: up down right left and diagonally). I do get the performance I am expecting in terms of reward/task completion, but the agent's movement is just not that smooth (see image below). What are my options here? I tried continuous action space (2 actions representing forces set on x and y axis), but that did not work really well. https://preview.redd.it/7ihp4t7kila81.png?width=444&format=png&auto=webp&s=585cd280e7cb1c2cf30ab7fe1cf22565eb2409b7 submitted by /u/AhmedNizam_ [link] [comments]
    representing multiple actions using gym multi-discrete
    (1 min) Hello Everyone, I am trying to implement a RL agent using using PPO with actor/critic. The agent has to move in an x-y plane by setting 2 discretized forces (2 actions) along its x and y axis. Initially I thought I need two output heads for my actor network, one for each action. However I came across this work by OpenAI, where they have a similar agent. They however use one output head for the movement action (along x y and z), where the action has a "multidiscrete" type. Any idea how this works? I have tried to understand it from the gym code but I dont get what "multidiscrete" does? Is it a way of encoding all the combinations of the different actions? Any help/insight is really appreciated submitted by /u/AhmedNizam_ [link] [comments]

2022-01-08

  • cs.LG updates on arXiv.org

    Sparsity-based Feature Selection for Anomalous Subgroup Discovery. (arXiv:2201.02008v1 [cs.LG])
    (2 min) Anomalous pattern detection aims to identify instances where deviation from normalcy is evident, and is widely applicable across domains. Multiple anomalous detection techniques have been proposed in the state of the art. However, there is a common lack of a principled and scalable feature selection method for efficient discovery. Existing feature selection techniques are often conducted by optimizing the performance of prediction outcomes rather than its systemic deviations from the expected. In this paper, we proposed a sparsity-based automated feature selection (SAFS) framework, which encodes systemic outcome deviations via the sparsity of feature-driven odds ratios. SAFS is a model-agnostic approach with usability across different discovery techniques. SAFS achieves more than $3\times$ reduction in computation time while maintaining detection performance when validated on publicly available critical care dataset. SAFS also results in a superior performance when compared against multiple baselines for feature selection.
    ExSinGAN: Learning an Explainable Generative Model from a Single Image. (arXiv:2105.07350v2 [cs.CV] UPDATED)
    (2 min) Generating images from a single sample, as a newly developing branch of image synthesis, has attracted extensive attention. In this paper, we formulate this problem as sampling from the conditional distribution of a single image, and propose a hierarchical framework that simplifies the learning of the intricate conditional distributions through the successive learning of the distributions about structure, semantics and texture, making the process of learning and generation comprehensible. On this basis, we design ExSinGAN composed of three cascaded GANs for learning an explainable generative model from a given image, where the cascaded GANs model the distributions about structure, semantics and texture successively. ExSinGAN is learned not only from the internal patches of the given image as the previous works did, but also from the external prior obtained by the GAN inversion technique. Benefiting from the appropriate combination of internal and external information, ExSinGAN has a more powerful capability of generation and competitive generalization ability for the image manipulation tasks compared with prior works.
    Formal Analysis of Art: Proxy Learning of Visual Concepts from Style Through Language Models. (arXiv:2201.01819v1 [cs.LG])
    (2 min) We present a machine learning system that can quantify fine art paintings with a set of visual elements and principles of art. This formal analysis is fundamental for understanding art, but developing such a system is challenging. Paintings have high visual complexities, but it is also difficult to collect enough training data with direct labels. To resolve these practical limitations, we introduce a novel mechanism, called proxy learning, which learns visual concepts in paintings though their general relation to styles. This framework does not require any visual annotation, but only uses style labels and a general relationship between visual concepts and style. In this paper, we propose a novel proxy model and reformulate four pre-existing methods in the context of proxy learning. Through quantitative and qualitative comparison, we evaluate these methods and compare their effectiveness in quantifying the artistic visual concepts, where the general relationship is estimated by language models; GloVe or BERT. The language modeling is a practical and scalable solution requiring no labeling, but it is inevitably imperfect. We demonstrate how the new proxy model is robust to the imperfection, while the other models are sensitively affected by it.
    Balancing Generalization and Specialization in Zero-shot Learning. (arXiv:2201.01961v1 [cs.CV])
    (2 min) Zero-Shot Learning (ZSL) aims to transfer classification capability from seen to unseen classes. Recent methods have proved that generalization and specialization are two essential abilities to achieve good performance in ZSL. However, they all focus on only one of the abilities, resulting in models that are either too general with the degraded classifying ability or too specialized to generalize to unseen classes. In this paper, we propose an end-to-end network with balanced generalization and specialization abilities, termed as BGSNet, to take advantage of both abilities, and balance them at instance- and dataset-level. Specifically, BGSNet consists of two branches: the Generalization Network (GNet), which applies episodic meta-learning to learn generalized knowledge, and the Balanced Specialization Network (BSNet), which adopts multiple attentive extractors to extract discriminative features and fulfill the instance-level balance. A novel self-adjusting diversity loss is designed to optimize BSNet with less redundancy and more diversity. We further propose a differentiable dataset-level balance and update the weights in a linear annealing schedule to simulate network pruning and thus obtain the optimal structure for BSNet at a low cost with dataset-level balance achieved. Experiments on four benchmark datasets demonstrate our model's effectiveness. Sufficient component ablations prove the necessity of integrating generalization and specialization abilities.
    Confidential Machine Learning Computation in Untrusted Environments: A Systems Security Perspective. (arXiv:2111.03308v3 [cs.CR] UPDATED)
    (2 min) As machine learning (ML) technologies and applications are rapidly changing many computing domains, security issues associated with ML are also emerging. In the domain of systems security, many endeavors have been made to ensure ML model and data confidentiality. ML computations are often inevitably performed in untrusted environments and entail complex multi-party security requirements. Hence, researchers have leveraged the Trusted Execution Environments (TEEs) to build confidential ML computation systems. We conduct a systematic and comprehensive survey by classifying attack vectors and mitigation in confidential ML computation in untrusted environments, analyzing the complex security requirements in multi-party scenarios, and summarizing engineering challenges in confidential ML implementation. Lastly, we suggest future research directions based on our study.
    Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. (arXiv:2003.03685v2 [stat.ME] UPDATED)
    (2 min) The paper introduces a novel conditional independence (CI) based method for linear and nonlinear, lagged and contemporaneous causal discovery from observational time series in the causally sufficient case. Existing CI-based methods such as the PC algorithm and also common methods from other frameworks suffer from low recall and partially inflated false positives for strong autocorrelation which is an ubiquitous challenge in time series. The novel method, PCMCI$^+$, extends PCMCI [Runge et al., 2019b] to include discovery of contemporaneous links. PCMCI$^+$ improves the reliability of CI tests by optimizing the choice of conditioning sets and even benefits from autocorrelation. The method is order-independent and consistent in the oracle case. A broad range of numerical experiments demonstrates that PCMCI$^+$ has higher adjacency detection power and especially more contemporaneous orientation recall compared to other methods while better controlling false positives. Optimized conditioning sets also lead to much shorter runtimes than the PC algorithm. PCMCI$^+$ can be of considerable use in many real world application scenarios where often time resolutions are too coarse to resolve time delays and strong autocorrelation is present.
    A probabilistic model for fast-to-evaluate 2D crack path prediction in heterogeneous materials. (arXiv:2112.13578v2 [cs.LG] UPDATED)
    (2 min) This paper is devoted to the construction of a new fast-to-evaluate model for the prediction of 2D crack paths in concrete-like microstructures. The model generates piecewise linear cracks paths with segmentation points selected using a Markov chain model. The Markov chain kernel involves local indicators of mechanical interest and its parameters are learnt from numerical full-field 2D simulations of craking using a cohesive-volumetric finite element solver called XPER. The resulting model exhibits a drastic improvement of CPU time in comparison to simulations from XPER.
    Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks. (arXiv:2201.02143v1 [cs.LG])
    (2 min) Classification of long sequential data is an important Machine Learning task and appears in many application scenarios. Recurrent Neural Networks, Transformers, and Convolutional Neural Networks are three major techniques for learning from sequential data. Among these methods, Temporal Convolutional Networks (TCNs) which are scalable to very long sequences have achieved remarkable progress in time series regression. However, the performance of TCNs for sequence classification is not satisfactory because they use a skewed connection protocol and output classes at the last position. Such asymmetry restricts their performance for classification which depends on the whole sequence. In this work, we propose a symmetric multi-scale architecture called Circular Dilated Convolutional Neural Network (CDIL-CNN), where every position has an equal chance to receive information from other positions at the previous layers. Our model gives classification logits in all positions, and we can apply a simple ensemble learning to achieve a better decision. We have tested CDIL-CNN on various long sequential datasets. The experimental results show that our method has superior performance over many state-of-the-art approaches.
    Improving Spectral Clustering Using Spectrum-Preserving Node Reduction. (arXiv:2110.12328v2 [cs.LG] UPDATED)
    (2 min) Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.
    Jointly Efficient and Optimal Algorithms for Logistic Bandits. (arXiv:2201.01985v1 [cs.LG])
    (2 min) Logistic Bandits have recently undergone careful scrutiny by virtue of their combined theoretical and practical relevance. This research effort delivered statistically efficient algorithms, improving the regret of previous strategies by exponentially large factors. Such algorithms are however strikingly costly as they require $\Omega(t)$ operations at each round. On the other hand, a different line of research focused on computational efficiency ($\mathcal{O}(1)$ per-round cost), but at the cost of letting go of the aforementioned exponential improvements. Obtaining the best of both world is unfortunately not a matter of marrying both approaches. Instead we introduce a new learning procedure for Logistic Bandits. It yields confidence sets which sufficient statistics can be easily maintained online without sacrificing statistical tightness. Combined with efficient planning mechanisms we design fast algorithms which regret performance still match the problem-dependent lower-bound of Abeille et al. (2021). To the best of our knowledge, those are the first Logistic Bandit algorithms that simultaneously enjoy statistical and computational efficiency.
    CLLD: Contrastive Learning with Label Distance for Text Classification. (arXiv:2110.13656v3 [cs.LG] UPDATED)
    (2 min) Existed pre-trained models have achieved state-of-the-art performance on various text classification tasks. These models have proven to be useful in learning universal language representations. However, the semantic discrepancy between similar texts cannot be effectively distinguished by advanced pre-trained models, which have a great influence on the performance of hard-to-distinguish classes. To address this problem, we propose a novel Contrastive Learning with Label Distance (CLLD) in this work. Inspired by recent advances in contrastive learning, we specifically design a classification method with label distance for learning contrastive classes. CLLD ensures the flexibility within the subtle differences that lead to different label assignments, and generates the distinct representations for each class having similarity simultaneously. Extensive experiments on public benchmarks and internal datasets demonstrate that our method improves the performance of pre-trained models on classification tasks. Importantly, our experiments suggest that the learned label distance relieve the adversarial nature of interclasses.
    Efficient Global Optimization of Two-layer ReLU Networks: Quadratic-time Algorithms and Adversarial Training. (arXiv:2201.01965v1 [cs.LG])
    (2 min) The non-convexity of the artificial neural network (ANN) training landscape brings inherent optimization difficulties. While the traditional back-propagation stochastic gradient descent (SGD) algorithm and its variants are effective in certain cases, they can become stuck at spurious local minima and are sensitive to initializations and hyperparameters. Recent work has shown that the training of an ANN with ReLU activations can be reformulated as a convex program, bringing hope to globally optimizing interpretable ANNs. However, naively solving the convex training formulation has an exponential complexity, and even an approximation heuristic requires cubic time. In this work, we characterize the quality of this approximation and develop two efficient algorithms that train ANNs with global convergence guarantees. The first algorithm is based on the alternating direction method of multiplier (ADMM). It solves both the exact convex formulation and the approximate counterpart. Linear global convergence is achieved, and the initial several iterations often yield a solution with high prediction accuracy. When solving the approximate formulation, the per-iteration time complexity is quadratic. The second algorithm, based on the "sampled convex programs" theory, is simpler to implement. It solves unconstrained convex formulations and converges to an approximately globally optimal classifier. The non-convexity of the ANN training landscape exacerbates when adversarial training is considered. We apply the robust convex optimization theory to convex training and develop convex formulations that train ANNs robust to adversarial inputs. Our analysis explicitly focuses on one-hidden-layer fully connected ANNs, but can extend to more sophisticated architectures.
    Towards a theory of out-of-distribution learning. (arXiv:2109.14501v4 [stat.ML] UPDATED)
    (2 min) What is learning? 20$^{st}$ century formalizations of learning theory -- which precipitated revolutions in artificial intelligence -- focus primarily on $\mathit{in-distribution}$ learning, that is, learning under the assumption that the training data are sampled from the same distribution as the evaluation distribution. This assumption renders these theories inadequate for characterizing 21$^{st}$ century real world data problems, which are typically characterized by evaluation distributions that differ from the training data distributions (referred to as out-of-distribution learning). We therefore make a small change to existing formal definitions of learnability by relaxing that assumption. We then introduce $\mathbf{learning\ efficiency}$ (LE) to quantify the amount a learner is able to leverage data for a given problem, regardless of whether it is an in- or out-of-distribution problem. We then define and prove the relationship between generalized notions of learnability, and show how this framework is sufficiently general to characterize transfer, multitask, meta, continual, and lifelong learning. We hope this unification helps bridge the gap between empirical practice and theoretical guidance in real world problems. Finally, because biological learning continues to outperform machine learning algorithms on certain OOD challenges, we discuss the limitations of this framework vis-\'a-vis its ability to formalize biological learning, suggesting multiple avenues for future research.
    Randomized Spectral Clustering in Large-Scale Stochastic Block Models. (arXiv:2002.00839v3 [cs.SI] UPDATED)
    (2 min) Spectral clustering has been one of the widely used methods for community detection in networks. However, large-scale networks bring computational challenges to the eigenvalue decomposition therein. In this paper, we study the spectral clustering using randomized sketching algorithms from a statistical perspective, where we typically assume the network data are generated from a stochastic block model that is not necessarily of full rank. To do this, we first use the recently developed sketching algorithms to obtain two randomized spectral clustering algorithms, namely, the random projection-based and the random sampling-based spectral clustering. Then we study the theoretical bounds of the resulting algorithms in terms of the approximation error for the population adjacency matrix, the misclassification error, and the estimation error for the link probability matrix. It turns out that, under mild conditions, the randomized spectral clustering algorithms lead to the same theoretical bounds as those of the original spectral clustering algorithm. We also extend the results to degree-corrected stochastic block models. Numerical experiments support our theoretical findings and show the efficiency of randomized methods. A new R package called Rclust is developed and made available to the public.
    A note on efficient minimum cost adjustment sets in causal graphical models. (arXiv:2201.02037v1 [math.ST])
    (2 min) We study the selection of adjustment sets for estimating the interventional mean under an individualized treatment rule. We assume a non-parametric causal graphical model with, possibly, hidden variables and at least one adjustment set comprised of observable variables. Moreover, we assume that observable variables have positive costs associated with them. We define the cost of an observable adjustment set as the sum of the costs of the variables that comprise it. We show that in this setting there exist adjustment sets that are minimum cost optimal, in the sense that they yield non-parametric estimators of the interventional mean with the smallest asymptotic variance among those that control for observable adjustment sets that have minimum cost. Our results are based on the construction of a special flow network associated with the original causal graph. We show that a minimum cost optimal adjustment set can be found by computing a maximum flow on the network, and then finding the set of vertices that are reachable from the source by augmenting paths. The optimaladj Python package implements the algorithms introduced in this paper.
    AUGCO: Augmentation Consistency-guided Self-training for Source-free Domain Adaptive Semantic Segmentation. (arXiv:2107.10140v2 [cs.CV] UPDATED)
    (2 min) Most modern approaches for domain adaptive semantic segmentation rely on continued access to source data during adaptation, which may be infeasible due to computational or privacy constraints. We focus on source-free domain adaptation for semantic segmentation, wherein a source model must adapt itself to a new target domain given only unlabeled target data. We propose Augmentation Consistency-guided Self-training (AUGCO), a source-free adaptation algorithm that uses the model's pixel-level predictive consistency across diverse, automatically generated views of each target image along with model confidence to identify reliable pixel predictions, and selectively self-trains on those. AUGCO achieves state-of-the-art results for source-free adaptation on 3 standard benchmarks for semantic segmentation, all within a simple to implement and fast to converge method.
    Towards control of opinion diversity by introducing zealots into a polarised social group. (arXiv:2006.07265v7 [cs.SI] UPDATED)
    (2 min) We explore a method to influence or even control the diversity of opinions within a polarised social group. We leverage the voter model in which users hold binary opinions and repeatedly update their beliefs based on others they connect with. Stubborn agents who never change their minds ("zealots") are also disseminated through the network, which is modelled by a connected graph. Building on earlier results, we provide a closed-form expression for the average opinion of the group at equilibrium. This leads us to a strategy to inject zealots into a polarised network in order to shift the average opinion towards any target value. We account for the possible presence of a backfire effect, which may lead the group to react negatively and reinforce its level of polarisation in response. Our results are supported by numerical experiments on synthetic data.
    The dynamics of representation learning in shallow, non-linear autoencoders. (arXiv:2201.02115v1 [stat.ML])
    (2 min) Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.
    Revisiting Deep Subspace Alignment for Unsupervised Domain Adaptation. (arXiv:2201.01806v1 [cs.LG])
    (2 min) Unsupervised domain adaptation (UDA) aims to transfer and adapt knowledge from a labeled source domain to an unlabeled target domain. Traditionally, subspace-based methods form an important class of solutions to this problem. Despite their mathematical elegance and tractability, these methods are often found to be ineffective at producing domain-invariant features with complex, real-world datasets. Motivated by the recent advances in representation learning with deep networks, this paper revisits the use of subspace alignment for UDA and proposes a novel adaptation algorithm that consistently leads to improved generalization. In contrast to existing adversarial training-based DA methods, our approach isolates feature learning and distribution alignment steps, and utilizes a primary-auxiliary optimization strategy to effectively balance the objectives of domain invariance and model fidelity. While providing a significant reduction in target data and computational requirements, our subspace-based DA performs competitively and sometimes even outperforms state-of-the-art approaches on several standard UDA benchmarks. Furthermore, subspace alignment leads to intrinsically well-regularized models that demonstrate strong generalization even in the challenging partial DA setting. Finally, the design of our UDA framework inherently supports progressive adaptation to new target domains at test-time, without requiring retraining of the model from scratch. In summary, powered by powerful feature learners and an effective optimization strategy, we establish subspace-based DA as a highly effective approach for visual recognition.
    Gaussian Imagination in Bandit Learning. (arXiv:2201.01902v1 [cs.LG])
    (2 min) Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We consider an agent who is designed to attain a low information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function, but study the agent's performance when applied instead to a Bernoulli bandit. We establish a bound on the increase in Bayesian regret when an agent interacts with the Bernoulli bandit, relative to an information-theoretic bound satisfied with the Gaussian bandit. If the Gaussian prior distribution and likelihood function are sufficiently diffuse, this increase grows with the square-root of the time horizon, and thus the per-timestep increase vanishes. Our results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
    Automated Scoring of Graphical Open-Ended Responses Using Artificial Neural Networks. (arXiv:2201.01783v1 [cs.CV])
    (2 min) Automated scoring of free drawings or images as responses has yet to be utilized in large-scale assessments of student achievement. In this study, we propose artificial neural networks to classify these types of graphical responses from a computer based international mathematics and science assessment. We are comparing classification accuracy of convolutional and feedforward approaches. Our results show that convolutional neural networks (CNNs) outperform feedforward neural networks in both loss and accuracy. The CNN models classified up to 97.71% of the image responses into the appropriate scoring category, which is comparable to, if not more accurate, than typical human raters. These findings were further strengthened by the observation that the most accurate CNN models correctly classified some image responses that had been incorrectly scored by the human raters. As an additional innovation, we outline a method to select human rated responses for the training sample based on an application of the expected response function derived from item response theory. This paper argues that CNN-based automated scoring of image responses is a highly accurate procedure that could potentially replace the workload and cost of second human raters for large scale assessments, while improving the validity and comparability of scoring complex constructed-response items.
    Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria. (arXiv:2201.01816v1 [cs.AI])
    (2 min) A key challenge in the study of multiagent cooperation is the need for individual agents not only to cooperate effectively, but to decide with whom to cooperate. This is particularly critical in situations when other agents have hidden, possibly misaligned motivations and goals. Social deduction games offer an avenue to study how individuals might learn to synthesize potentially unreliable information about others, and elucidate their true motivations. In this work, we present Hidden Agenda, a two-team social deduction game that provides a 2D environment for studying learning agents in scenarios of unknown team alignment. The environment admits a rich set of strategies for both teams. Reinforcement learning agents trained in Hidden Agenda show that agents can learn a variety of behaviors, including partnering and voting without need for communication in natural language.
    POCO: Point Convolution for Surface Reconstruction. (arXiv:2201.01831v1 [cs.CV])
    (2 min) Implicit neural networks have been successfully used for surface reconstruction from point clouds. However, many of them face scalability issues as they encode the isosurface function of a whole object or scene into a single latent vector. To overcome this limitation, a few approaches infer latent vectors on a coarse regular 3D grid or on 3D patches, and interpolate them to answer occupancy queries. In doing so, they loose the direct connection with the input points sampled on the surface of objects, and they attach information uniformly in space rather than where it matters the most, i.e., near the surface. Besides, relying on fixed patch sizes may require discretization tuning. To address these issues, we propose to use point cloud convolutions and compute latent vectors at each input point. We then perform a learning-based interpolation on nearest neighbors using inferred weights. Experiments on both object and scene datasets show that our approach significantly outperforms other methods on most classical metrics, producing finer details and better reconstructing thinner volumes. The code is available at https://github.com/valeoai/POCO.
    FCNN: Five-point stencil CNN for solving reaction-diffusion equations. (arXiv:2201.01854v1 [cs.LG])
    (2 min) In this paper, we propose Five-point stencil CNN (FCNN) containing a five-point stencil kernel and a trainable approximation function. We consider reaction-diffusion type equations including heat, Fisher's, Allen-Cahn equations, and reaction-diffusion equations with trigonometric functions. Our proposed FCNN is trained well using few data and then can predict reaction-diffusion evolutions with unseen initial conditions. Also, our FCNN is trained well in the case of using noisy train data. We present various simulation results to demonstrate that our proposed FCNN is working well.
    Necessary and sufficient graphical conditions for optimal adjustment sets in causal graphical models with hidden variables. (arXiv:2102.10324v3 [cs.LG] UPDATED)
    (2 min) The problem of selecting optimal backdoor adjustment sets to estimate causal effects in graphical models with hidden and conditioned variables is addressed. Previous work has defined optimality as achieving the smallest asymptotic estimation variance and derived an optimal set for the case without hidden variables. For the case with hidden variables there can be settings where no optimal set exists and currently only a sufficient graphical optimality criterion of limited applicability has been derived. In the present work optimality is characterized as maximizing a certain adjustment information which allows to derive a necessary and sufficient graphical criterion for the existence of an optimal adjustment set and a definition and algorithm to construct it. Further, the optimal set is valid if and only if a valid adjustment set exists and has higher (or equal) adjustment information than the Adjust-set proposed in Perkovi{\'c} et al. [Journal of Machine Learning Research, 18: 1--62, 2018] for any graph. The results translate to minimal asymptotic estimation variance for a class of estimators whose asymptotic variance follows a certain information-theoretic relation. Numerical experiments indicate that the asymptotic results also hold for relatively small sample sizes and that the optimal adjustment set or minimized variants thereof often yield better variance also beyond that estimator class. Surprisingly, among the randomly created setups more than 90\% fulfill the optimality conditions indicating that also in many real-world scenarios graphical optimality may hold. Code is available as part of the python package \url{https://github.com/jakobrunge/tigramite}.
    PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers. (arXiv:2111.12710v2 [cs.CV] UPDATED)
    (2 min) This paper explores a better codebook for BERT pre-training of vision transformers. The recent work BEiT successfully transfers BERT pre-training from NLP to the vision field. It directly adopts one simple discrete VAE as the visual tokenizer, but has not considered the semantic level of the resulting visual tokens. By contrast, the discrete tokens in NLP field are naturally highly semantic. This difference motivates us to learn a perceptual codebook. And we surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3 with the same pre-training epochs. It can also improve the performance of object detection and segmentation tasks on COCO val by +1.3 box AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art performance (88.3% Top-1 accuracy) among the methods using only ImageNet-1K data. The code and models will be available at https://github.com/microsoft/PeCo.
    Machine Learning: Algorithms, Models, and Applications. (arXiv:2201.01943v1 [cs.LG])
    (2 min) Recent times are witnessing rapid development in machine learning algorithm systems, especially in reinforcement learning, natural language processing, computer and robot vision, image processing, speech, and emotional processing and understanding. In tune with the increasing importance and relevance of machine learning models, algorithms, and their applications, and with the emergence of more innovative uses cases of deep learning and artificial intelligence, the current volume presents a few innovative research works and their applications in real world, such as stock trading, medical and healthcare systems, and software automation. The chapters in the book illustrate how machine learning and deep learning algorithms and models are designed, optimized, and deployed. The volume will be useful for advanced graduate and doctoral students, researchers, faculty members of universities, practicing data scientists and data engineers, professionals, and consultants working on the broad areas of machine learning, deep learning, and artificial intelligence.
    A Light in the Dark: Deep Learning Practices for Industrial Computer Vision. (arXiv:2201.02028v1 [cs.CV])
    (2 min) In recent years, large pre-trained deep neural networks (DNNs) have revolutionized the field of computer vision (CV). Although these DNNs have been shown to be very well suited for general image recognition tasks, application in industry is often precluded for three reasons: 1) large pre-trained DNNs are built on hundreds of millions of parameters, making deployment on many devices impossible, 2) the underlying dataset for pre-training consists of general objects, while industrial cases often consist of very specific objects, such as structures on solar wafers, 3) potentially biased pre-trained DNNs raise legal issues for companies. As a remedy, we study neural networks for CV that we train from scratch. For this purpose, we use a real-world case from a solar wafer manufacturer. We find that our neural networks achieve similar performances as pre-trained DNNs, even though they consist of far fewer parameters and do not rely on third-party datasets.
    A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions. (arXiv:2201.01836v1 [cs.LG])
    (2 min) Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a policy-dependent model--and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the $\eta$-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge--with a parameter $\eta$ capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an $\eta\gamma$-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.
    Round and Communication Balanced Protocols for Oblivious Evaluation of Finite State Machines. (arXiv:2103.11240v2 [cs.CR] UPDATED)
    (2 min) We propose protocols for obliviously evaluating finite-state machines, i.e., the evaluation is shared between the provider of the finite-state machine and the provider of the input string in such a manner that neither party learns the other's input, and the states being visited are hidden from both. For alphabet size $|\Sigma|$, number of states $|Q|$, and input length $n$, previous solutions have either required a number of rounds linear in $n$ or communication $\Omega(n|\Sigma||Q|\log|Q|)$. Our solutions require 2 rounds with communication $O(n(|\Sigma|+|Q|\log|Q|))$. We present two different solutions to this problem, a two-party one and a setting with an untrusted but non-colluding helper.
    Frame Shift Prediction. (arXiv:2201.01837v1 [cs.CL])
    (2 min) Frame shift is a cross-linguistic phenomenon in translation which results in corresponding pairs of linguistic material evoking different frames. The ability to predict frame shifts enables automatic creation of multilingual FrameNets through annotation projection. Here, we propose the Frame Shift Prediction task and demonstrate that graph attention networks, combined with auxiliary training, can learn cross-linguistic frame-to-frame correspondence and predict frame shifts.
    Online Influence Maximization with Node-level Feedback Using Standard Offline Oracles. (arXiv:2109.06077v2 [cs.SI] UPDATED)
    (3 min) We study the online influence maximization (OIM) problem in social networks, where in multiple rounds the learner repeatedly chooses seed nodes to generate cascades, observes the cascade feedback, and gradually learns the best seeds that generate the largest cascade. We focus on two major challenges in this paper. First, we work with node-level feedback instead of edge-level feedback. The edge-level feedback reveals all edges that pass through information in a cascade, where the node-level feedback only reveals the activated nodes with timestamps. The node-level feedback is arguably more realistic since in practice it is relatively easy to observe who is influenced but very difficult to observe from which relationship (edge) the influence comes from. Second, we use standard offline oracle instead of offline pair-oracle. To compute a good seed set for the next round, an offline pair-oracle finds the best seed set and the best parameters within the confidence region simultaneously, and such an oracle is difficult to compute due to the combinatorial core of OIM problem. So we focus on how to use the standard offline influence maximization oracle which finds the best seed set given the edge parameters as input. In this paper, we resolve these challenges for the two most popular diffusion models, the independent cascade (IC) and the linear threshold (LT) model. For the IC model, the past research only achieves edge-level feedback, while we present the first $\widetilde{O}(\sqrt{T})$-regret algorithm for the node-level feedback. Besides, the algorithm only invokes standard offline oracles. For the LT model, a recent study only provides an OIM solution that meets the first challenge but still requires a pair-oracle. In this paper, we apply a similar technique as in the IC model to replace the pair-oracle with a standard oracle while maintaining $\widetilde{O}(\sqrt{T})$-regret.
    Skip Vectors for RDF Data: Extraction Based on the Complexity of Feature Patterns. (arXiv:2201.01996v1 [cs.LG])
    (2 min) The Resource Description Framework (RDF) is a framework for describing metadata, such as attributes and relationships of resources on the Web. Machine learning tasks for RDF graphs adopt three methods: (i) support vector machines (SVMs) with RDF graph kernels, (ii) RDF graph embeddings, and (iii) relational graph convolutional networks. In this paper, we propose a novel feature vector (called a Skip vector) that represents some features of each resource in an RDF graph by extracting various combinations of neighboring edges and nodes. In order to make the Skip vector low-dimensional, we select important features for classification tasks based on the information gain ratio of each feature. The classification tasks can be performed by applying the low-dimensional Skip vector of each resource to conventional machine learning algorithms, such as SVMs, the k-nearest neighbors method, neural networks, random forests, and AdaBoost. In our evaluation experiments with RDF data, such as Wikidata, DBpedia, and YAGO, we compare our method with RDF graph kernels in an SVM. We also compare our method with the two approaches: RDF graph embeddings such as RDF2vec and relational graph convolutional networks on the AIFB, MUTAG, BGS, and AM benchmarks.
    Forming Predictive Features of Tweets for Decision-Making Support. (arXiv:2201.02049v1 [cs.CL])
    (2 min) The article describes the approaches for forming different predictive features of tweet data sets and using them in the predictive analysis for decision-making support. The graph theory as well as frequent itemsets and association rules theory is used for forming and retrieving different features from these datasests. The use of these approaches makes it possible to reveal a semantic structure in tweets related to a specified entity. It is shown that quantitative characteristics of semantic frequent itemsets can be used in predictive regression models with specified target variables.
    Sentiment Analysis and Sarcasm Detection of Indian General Election Tweets. (arXiv:2201.02127v1 [cs.IR])
    (2 min) Social Media usage has increased to an all-time high level in today's digital world. The majority of the population uses social media tools (like Twitter, Facebook, YouTube, etc.) to share their thoughts and experiences with the community. Analysing the sentiments and opinions of the common public is very important for both the government and the business people. This is the reason behind the activeness of many media agencies during the election time for performing various kinds of opinion polls. In this paper, we have worked towards analysing the sentiments of the people of India during the Lok Sabha election of 2019 using the Twitter data of that duration. We have built an automatic tweet analyser using the Transfer Learning technique to handle the unsupervised nature of this problem. We have used the Linear Support Vector Classifiers method in our Machine Learning model, also, the Term Frequency Inverse Document Frequency (TF-IDF) methodology for handling the textual data of tweets. Further, we have increased the capability of the model to address the sarcastic tweets posted by some of the users, which has not been yet considered by the researchers in this domain.
    A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics. (arXiv:2201.02025v1 [cs.LG])
    (2 min) A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics is proposed and validated using high-temperature auto-ignitions, perfectly stirred reactors (PSR), and one-dimensional freely propagating flames of n-heptane/air mixtures. The mechanism reduction is modeled as an optimization problem on Boolean space, where a Boolean vector, each entry corresponding to a species, represents a reduced mechanism. The optimization goal is to minimize the reduced mechanism size given the error tolerance of a group of pre-selected benchmark quantities. The key idea of the DeePMR is to employ a deep neural network (DNN) to formulate the objective function in the optimization problem. In order to explore high dimensional Boolean space efficiently, an iterative DNN-assisted data sampling and DNN training procedure are implemented. The results show that DNN-assistance improves sampling efficiency significantly, selecting only $10^5$ samples out of $10^{34}$ possible samples for DNN to achieve sufficient accuracy. The results demonstrate the capability of the DNN to recognize key species and reasonably predict reduced mechanism performance. The well-trained DNN guarantees the optimal reduced mechanism by solving an inverse optimization problem. By comparing ignition delay times, laminar flame speeds, temperatures in PSRs, the resulting skeletal mechanism has fewer species (45 species) but the same level of accuracy as the skeletal mechanism (56 species) obtained by the Path Flux Analysis (PFA) method. In addition, the skeletal mechanism can be further reduced to 28 species if only considering atmospheric, near-stoichiometric conditions (equivalence ratio between 0.6 and 1.2). The DeePMR provides an innovative way to perform model reduction and demonstrates the great potential of data-driven methods in the combustion area.
    Machine-Learning the Classification of Spacetimes. (arXiv:2201.01644v1 [gr-qc] CROSS LISTED)
    (2 min) On the long-established classification problems in general relativity we take a novel perspective by adopting fruitful techniques from machine learning and modern data-science. In particular, we model Petrov's classification of spacetimes, and show that a feed-forward neural network can achieve high degree of success. We also show how data visualization techniques with dimensionality reduction can help analyze the underlying patterns in the structure of the different types of spacetimes.
    Bayesian Regression Approach for Building and Stacking Predictive Models in Time Series Analytics. (arXiv:2201.02034v1 [stat.AP])
    (2 min) The paper describes the use of Bayesian regression for building time series models and stacking different predictive models for time series. Using Bayesian regression for time series modeling with nonlinear trend was analyzed. This approach makes it possible to estimate an uncertainty of time series prediction and calculate value at risk characteristics. A hierarchical model for time series using Bayesian regression has been considered. In this approach, one set of parameters is the same for all data samples, other parameters can be different for different groups of data samples. Such an approach allows using this model in the case of short historical data for specified time series, e.g. in the case of new stores or new products in the sales prediction problem. In the study of predictive models stacking, the models ARIMA, Neural Network, Random Forest, Extra Tree were used for the prediction on the first level of model ensemble. On the second level, time series predictions of these models on the validation set were used for stacking by Bayesian regression. This approach gives distributions for regression coefficients of these models. It makes it possible to estimate the uncertainty contributed by each model to stacking result. The information about these distributions allows us to select an optimal set of stacking models, taking into account the domain knowledge. The probabilistic approach for stacking predictive models allows us to make risk assessment for the predictions that are important in a decision-making process.
    Deep Causal Reasoning for Recommendations. (arXiv:2201.02088v1 [cs.LG])
    (2 min) Traditional recommender systems aim to estimate a user's rating to an item based on observed ratings from the population. As with all observational studies, hidden confounders, which are factors that affect both item exposures and user ratings, lead to a systematic bias in the estimation. Consequently, a new trend in recommender system research is to negate the influence of confounders from a causal perspective. Observing that confounders in recommendations are usually shared among items and are therefore multi-cause confounders, we model the recommendation as a multi-cause multi-outcome (MCMO) inference problem. Specifically, to remedy confounding bias, we estimate user-specific latent variables that render the item exposures independent Bernoulli trials. The generative distribution is parameterized by a DNN with factorized logistic likelihood and the intractable posteriors are estimated by variational inference. Controlling these factors as substitute confounders, under mild assumptions, can eliminate the bias incurred by multi-cause confounders. Furthermore, we show that MCMO modeling may lead to high variance due to scarce observations associated with the high-dimensional causal space. Fortunately, we theoretically demonstrate that introducing user features as pre-treatment variables can substantially improve sample efficiency and alleviate overfitting. Empirical studies on simulated and real-world datasets show that the proposed deep causal recommender shows more robustness to unobserved confounders than state-of-the-art causal recommenders. Codes and datasets are released at https://github.com/yaochenzhu/deep-deconf.
    An exploratory experiment on Hindi, Bengali hate-speech detection and transfer learning using neural networks. (arXiv:2201.01997v1 [cs.CL])
    (2 min) This work presents our approach to train a neural network to detect hate-speech texts in Hindi and Bengali. We also explore how transfer learning can be applied to learning these languages, given that they have the same origin and thus, are similar to some extend. Even though the whole experiment was conducted with low computational power, the obtained result is comparable to the results of other, more expensive, models. Furthermore, since the training data in use is relatively small and the two languages are almost entirely unknown to us, this work can be generalized as an effort to demystify lost or alien languages that no human is capable of understanding.
    Knowledge Informed Machine Learning using a Weibull-based Loss Function. (arXiv:2201.01769v1 [cs.LG])
    (2 min) Machine learning can be enhanced through the integration of external knowledge. This method, called knowledge informed machine learning, is also applicable within the field of Prognostics and Health Management (PHM). In this paper, the various methods of knowledge informed machine learning, from a PHM context, are reviewed with the goal of helping the reader understand the domain. In addition, a knowledge informed machine learning technique is demonstrated, using the common IMS and PRONOSTIA bearing data sets, for remaining useful life (RUL) prediction. Specifically, knowledge is garnered from the field of reliability engineering which is represented through the Weibull distribution. The knowledge is then integrated into a neural network through a novel Weibull-based loss function. A thorough statistical analysis of the Weibull-based loss function is conducted, demonstrating the effectiveness of the method on the PRONOSTIA data set. However, the Weibull-based loss function is less effective on the IMS data set. The results, shortcomings, and benefits of the approach are discussed in length. Finally, all the code is publicly available for the benefit of other researchers.
    Sales Time Series Analytics Using Deep Q-Learning. (arXiv:2201.02058v1 [cs.LG])
    (2 min) The article describes the use of deep Q-learning models in the problems of sales time series analytics. In contrast to supervised machine learning which is a kind of passive learning using historical data, Q-learning is a kind of active learning with goal to maximize a reward by optimal sequence of actions. Model free Q-learning approach for optimal pricing strategies and supply-demand problems was considered in the work. The main idea of the study is to show that using deep Q-learning approach in time series analytics, the sequence of actions can be optimized by maximizing the reward function when the environment for learning agent interaction can be modeled using the parametric model and in the case of using the model which is based on the historical data. In the pricing optimizing case study environment was modeled using sales dependence on extras price and randomly simulated demand. In the pricing optimizing case study, the environment was modeled using sales dependence on extra price and randomly simulated demand. In the supply-demand case study, it was proposed to use historical demand time series for environment modeling, agent states were represented by promo actions, previous demand values and weekly seasonality features. Obtained results show that using deep Q-learning, we can optimize the decision making process for price optimization and supply-demand problems. Environment modeling using parametric models and historical data can be used for the cold start of learning agent. On the next steps, after the cold start, the trained agent can be used in real business environment.
    Robust Linear Predictions: Analyses of Uniform Concentration, Fast Rates and Model Misspecification. (arXiv:2201.01973v1 [stat.ML])
    (2 min) The problem of linear predictions has been extensively studied for the past century under pretty generalized frameworks. Recent advances in the robust statistics literature allow us to analyze robust versions of classical linear models through the prism of Median of Means (MoM). Combining these approaches in a piecemeal way might lead to ad-hoc procedures, and the restricted theoretical conclusions that underpin each individual contribution may no longer be valid. To meet these challenges coherently, in this study, we offer a unified robust framework that includes a broad variety of linear prediction problems on a Hilbert space, coupled with a generic class of loss functions. Notably, we do not require any assumptions on the distribution of the outlying data points ($\mathcal{O}$) nor the compactness of the support of the inlying ones ($\mathcal{I}$). Under mild conditions on the dual norm, we show that for misspecification level $\epsilon$, these estimators achieve an error rate of $O(\max\left\{|\mathcal{O}|^{1/2}n^{-1/2}, |\mathcal{I}|^{1/2}n^{-1} \right\}+\epsilon)$, matching the best-known rates in literature. This rate is slightly slower than the classical rates of $O(n^{-1/2})$, indicating that we need to pay a price in terms of error rates to obtain robust estimates. Additionally, we show that this rate can be improved to achieve so-called ``fast rates" under additional assumptions.
    Deep Learning Assisted End-to-End Synthesis of mm-Wave Passive Networks with 3D EM Structures: A Study on A Transformer-Based Matching Network. (arXiv:2201.02141v1 [cs.LG])
    (2 min) This paper presents a deep learning assisted synthesis approach for direct end-to-end generation of RF/mm-wave passive matching network with 3D EM structures. Different from prior approaches that synthesize EM structures from target circuit component values and target topologies, our proposed approach achieves the direct synthesis of the passive network given the network topology from desired performance values as input. We showcase the proposed synthesis Neural Network (NN) model on an on-chip 1:1 transformer-based impedance matching network. By leveraging parameter sharing, the synthesis NN model successfully extracts relevant features from the input impedance and load capacitors, and predict the transformer 3D EM geometry in a 45nm SOI process that will match the standard 50$\Omega$ load to the target input impedance while absorbing the two loading capacitors. As a proof-of-concept, several example transformer geometries were synthesized, and verified in Ansys HFSS to provide the desired input impedance.
    Machine Learning for Variance Reduction in Online Experiments. (arXiv:2106.07263v3 [stat.ML] UPDATED)
    (2 min) We consider the problem of variance reduction in randomized controlled trials, through the use of covariates correlated with the outcome but independent of the treatment. We propose a machine learning regression-adjusted treatment effect estimator, which we call MLRATE. MLRATE uses machine learning predictors of the outcome to reduce estimator variance. It employs cross-fitting to avoid overfitting biases, and we prove consistency and asymptotic normality under general conditions. MLRATE is robust to poor predictions from the machine learning step: if the predictions are uncorrelated with the outcomes, the estimator performs asymptotically no worse than the standard difference-in-means estimator, while if predictions are highly correlated with outcomes, the efficiency gains are large. In A/A tests, for a set of 48 outcome metrics commonly monitored in Facebook experiments the estimator has over 70% lower variance than the simple difference-in-means estimator, and about 19% lower variance than the common univariate procedure which adjusts only for pre-experiment values of the outcome.
    S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration. (arXiv:2107.07983v2 [cs.AR] UPDATED)
    (2 min) Exploiting sparsity is a key technique in accelerating quantized convolutional neural network (CNN) inference on mobile devices. Prior sparse CNN accelerators largely exploit un-structured sparsity and achieve significant speedups. Due to the unbounded, largely unpredictable sparsity patterns, however, exploiting unstructured sparsity requires complicated hardware design with significant energy and area overhead, which is particularly detrimental to mobile/IoT inference scenarios where energy and area efficiency are crucial. We propose to exploit structured sparsity, more specifically, Density Bound Block (DBB) sparsity for both weights and activations. DBB block tensors bound the maximum number of non-zeros per block. DBB thus exposes statically predictable sparsity patterns that enable lean sparsity-exploiting hardware. We propose new hardware primitives to implement DBB sparsity for (static) weights and (dynamic) activations, respectively, with very low overheads. Building on top of the primitives, we describe S2TA, a systolic array-based CNN accelerator that exploits joint weight and activation DBB sparsity and new dimensions of data reuse unavailable on the traditional systolic array. S2TA in 16nm achieves more than 2x speedup and energy reduction compared to a strong baseline of a systolic array with zero-value clock gating, over five popular CNN benchmarks. Compared to two recent non-systolic sparse accelerators, Eyeriss v2 (65nm) and SparTen (45nm), S2TA in 65nm uses about 2.2x and 3.1x less energy per inference, respectively.
    Federated Optimization of Smooth Loss Functions. (arXiv:2201.01954v1 [cs.LG])
    (2 min) In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $\epsilon$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $\eta$-H\"older smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $\phi m(p/\epsilon)^{\Theta(d/\eta)}$ and that of FedAve scales like $\phi m(p/\epsilon)^{3/4}$ (neglecting sub-dominant factors), where $\phi\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.
    Learning in Markov Decision Processes under Constraints. (arXiv:2002.12435v5 [cs.LG] UPDATED)
    (2 min) We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step $t$, it earns a reward, and also incurs a cost-vector consisting of $M$ costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps, while simultaneously ensuring that the average values of the $M$ cost expenditures are bounded by agent-specified thresholds $c^{ub}_i,i=1,2,\ldots,M$. In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an $M+1$ dimensional regret vector that is composed of its reward regret, and $M$ cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the $i$-th component of the cost regret vector is the difference between its $i$-th cumulative cost expense and the expected cost expenditures $Tc^{ub}_i$. We prove that the expected value of the regret vector of UCRL-CMDP, is upper-bounded as $\tilde{O}\left(T^{2\slash 3}\right)$, where $T$ is the time horizon. We further show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can~\emph{tune the regret vector} according to the agent's requirements on its cost regrets.
    Contrastive Neighborhood Alignment. (arXiv:2201.01922v1 [cs.LG])
    (2 min) We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features whereby data points that are mapped to nearby representations by the source (teacher) model are also mapped to neighbors by the target (student) model. The target model aims to mimic the local structure of the source representation space using a contrastive loss. CNA is an unsupervised learning algorithm that does not require ground-truth labels for the individual samples. CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one. Experiments show that CNA is able to capture the manifold in a high-dimensional space and improves performance compared to the competing methods in their domains.
    GLAN: A Graph-based Linear Assignment Network. (arXiv:2201.02057v1 [cs.LG])
    (2 min) Differentiable solvers for the linear assignment problem (LAP) have attracted much research attention in recent years, which are usually embedded into learning frameworks as components. However, previous algorithms, with or without learning strategies, usually suffer from the degradation of the optimality with the increment of the problem size. In this paper, we propose a learnable linear assignment solver based on deep graph networks. Specifically, we first transform the cost matrix to a bipartite graph and convert the assignment task to the problem of selecting reliable edges from the constructed graph. Subsequently, a deep graph network is developed to aggregate and update the features of nodes and edges. Finally, the network predicts a label for each edge that indicates the assignment relationship. The experimental results on a synthetic dataset reveal that our method outperforms state-of-the-art baselines and achieves consistently high accuracy with the increment of the problem size. Furthermore, we also embed the proposed solver, in comparison with state-of-the-art baseline solvers, into a popular multi-object tracking (MOT) framework to train the tracker in an end-to-end manner. The experimental results on MOT benchmarks illustrate that the proposed LAP solver improves the tracker by the largest margin.
    BinarizedAttack: Structural Poisoning Attacks to Graph-based Anomaly Detection. (arXiv:2106.09989v5 [cs.LG] UPDATED)
    (2 min) Graph-based Anomaly Detection (GAD) is becoming prevalent due to the powerful representation abilities of graphs as well as recent advances in graph mining techniques. These GAD tools, however, expose a new attacking surface, ironically due to their unique advantage of being able to exploit the relations among data. That is, attackers now can manipulate those relations (i.e., the structure of the graph) to allow some target nodes to evade detection. In this paper, we exploit this vulnerability by designing a new type of targeted structural poisoning attacks to a representative regression-based GAD system termed OddBall. Specially, we formulate the attack against OddBall as a bi-level optimization problem, where the key technical challenge is to efficiently solve the problem in a discrete domain. We propose a novel attack method termed BinarizedAttack based on gradient descent. Comparing to prior arts, BinarizedAttack can better use the gradient information, making it particularly suitable for solving combinatorial optimization problems. Furthermore, we investigate the attack transferability of BinarizedAttack by employing it to attack other representation-learning-based GAD systems. Our comprehensive experiments demonstrate that BinarizedAttack is very effective in enabling target nodes to evade graph-based anomaly detection tools with limited attackers' budget, and in the black-box transfer attack setting, BinarizedAttack is also tested effective and in particular, can significantly change the node embeddings learned by the GAD systems. Our research thus opens the door to studying a new type of attack against security analytic tools that rely on graph data.
    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. (arXiv:2201.02177v1 [cs.LG])
    (2 min) In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.
    Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent. (arXiv:2006.01738v4 [cs.LG] UPDATED)
    (2 min) We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.
    Representation Learning for Online and Offline RL in Low-rank MDPs. (arXiv:2110.04652v3 [cs.LG] UPDATED)
    (2 min) This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for FLAMBE to $\widetilde{O}( A^2 d^4 / (\epsilon^2 (1-\gamma)^{5}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $\gamma$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.
    Online nonnegative CP-dictionary learning for Markovian data. (arXiv:2009.07612v3 [stat.ML] UPDATED)
    (2 min) Online Tensor Factorization (OTF) is a fundamental tool in learning low-dimensional interpretable features from streaming multi-modal data. While various algorithmic and theoretical aspects of OTF have been investigated recently, a general convergence guarantee to stationary points of the objective function without any incoherence or sparsity assumptions is still lacking even for the i.i.d. case. In this work, we introduce a novel algorithm that learns a CANDECOMP/PARAFAC (CP) basis from a given stream of tensor-valued data under general constraints, including nonnegativity constraints that induce interpretability of the learned CP basis. We prove that our algorithm converges almost surely to the set of stationary points of the objective function under the hypothesis that the sequence of data tensors is generated by an underlying Markov chain. Our setting covers the classical i.i.d. case as well as a wide range of application contexts including data streams generated by independent or MCMC sampling. Our result closes a gap between OTF and Online Matrix Factorization in global convergence analysis \commHL{for CP-decompositions}. Experimentally, we show that our algorithm converges much faster than standard algorithms for nonnegative tensor factorization tasks on both synthetic and real-world data. Also, we demonstrate the utility of our algorithm on a diverse set of examples from an image, video, and time-series data, illustrating how one may learn qualitatively different CP-dictionaries from the same tensor data by exploiting the tensor structure in multiple ways.
    Earthquake Nowcasting with Deep Learning. (arXiv:2201.01869v1 [physics.geo-ph])
    (2 min) We review previous approaches to nowcasting earthquakes and introduce new approaches based on deep learning using three distinct models based on recurrent neural networks and transformers. We discuss different choices for observables and measures presenting promising initial results for a region of Southern California from 1950-2020. Earthquake activity is predicted as a function of 0.1-degree spatial bins for time periods varying from two weeks to four years. The overall quality is measured by the Nash Sutcliffe Efficiency comparing the deviation of nowcast and observation with the variance over time in each spatial region. The software is available as open-source together with the preprocessed data from the USGS.
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v4 [stat.ML] UPDATED)
    (2 min) We derive conditions for the existence of fixed points of nonnegative neural networks, an important research objective to understand the behavior of neural networks in modern applications involving autoencoders and loop unrolling techniques, among others. In particular, we show that neural networks with nonnegative inputs and nonnegative parameters can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to derive conditions for the existence of a nonempty fixed point set of the nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis, which are typically based on the assumption of nonexpansivity of the activation functions. Furthermore, we prove that the shape of the fixed point set of monotonic and weakly scalable neural networks is often an interval, which degenerates to a point for the case of scalable networks. The chief results of this paper are verified in numerical simulations, where we consider an autoencoder-type network that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signals.
    Efficient Vertex-Oriented Polytopic Projection for Web-scale Applications. (arXiv:2103.05277v3 [cs.AI] UPDATED)
    (0 min) We consider applications involving a large set of instances of projecting points to polytopes. We develop an intuition guided by theoretical and empirical analysis to show that when these instances follow certain structures, a large majority of the projections lie on vertices of the polytopes. To do these projections efficiently we derive a vertex-oriented incremental algorithm to project a point onto any arbitrary polytope, as well as give specific algorithms to cater to simplex projection and polytopes where the unit box is cut by planes. Such settings are especially useful in web-scale applications such as optimal matching or allocation problems. Several such problems in internet marketplaces (e-commerce, ride-sharing, food delivery, professional services, advertising, etc.), can be formulated as Linear Programs (LP) with such polytope constraints that require a projection step in the overall optimization process. We show that in the very recent work, the polytopic projection is the most expensive step and our efficient projection algorithms help in gaining massive improvements in performance.
    Increasing the Confidence of Deep Neural Networks by Coverage Analysis. (arXiv:2101.12100v3 [cs.LG] UPDATED)
    (0 min) The great performance of machine learning algorithms and deep neural networks in several perception and control tasks is pushing the industry to adopt such technologies in safety-critical applications, as autonomous robots and self-driving vehicles. At present, however, several issues need to be solved to make deep learning methods more trustworthy, predictable, safe, and secure against adversarial attacks. Although several methods have been proposed to improve the trustworthiness of deep neural networks, most of them are tailored for specific classes of adversarial examples, hence failing to detect other corner cases or unsafe inputs that heavily deviate from the training samples. This paper presents a lightweight monitoring architecture based on coverage paradigms to enhance the model robustness against different unsafe inputs. In particular, four coverage analysis methods are proposed and tested in the architecture for evaluating multiple detection logics. Experimental results show that the proposed approach is effective in detecting both powerful adversarial examples and out-of-distribution inputs, introducing limited extra-execution time and memory requirements.
    Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks. (arXiv:2006.07002v7 [cs.LG] UPDATED)
    (0 min) We study the transfer learning process between two linear regression problems. An important and timely special case is when the regressors are overparameterized and perfectly interpolate their training data. We examine a parameter transfer mechanism whereby a subset of the parameters of the target task solution are constrained to the values learned for a related source task. We analytically characterize the generalization error of the target task in terms of the salient factors in the transfer learning architecture, i.e., the number of examples available, the number of (free) parameters in each of the tasks, the number of parameters transferred from the source to target task, and the relation between the two tasks. Our non-asymptotic analysis shows that the generalization error of the target task follows a two-dimensional double descent trend (with respect to the number of free parameters in each of the tasks) that is controlled by the transfer learning factors. Our analysis points to specific cases where the transfer of parameters is beneficial as a substitute for extra overparameterization (i.e., additional free parameters in the target task). Specifically, we show that the usefulness of a transfer learning setting is fragile and depends on a delicate interplay among the set of transferred parameters, the relation between the tasks, and the true solution. We also demonstrate that overparameterized transfer learning is not necessarily more beneficial when the source task is closer or identical to the target task.
    GRNN: Generative Regression Neural Network -- A Data Leakage Attack for Federated Learning. (arXiv:2105.00529v2 [cs.LG] UPDATED)
    (0 min) Data privacy has become an increasingly important issue in Machine Learning (ML), where many approaches have been developed to tackle this challenge, e.g. cryptography (Homomorphic Encryption (HE), Differential Privacy (DP), etc.) and collaborative training (Secure Multi-Party Computation (MPC), Distributed Learning and Federated Learning (FL)). These techniques have a particular focus on data encryption or secure local computation. They transfer the intermediate information to the third party to compute the final result. Gradient exchanging is commonly considered to be a secure way of training a robust model collaboratively in Deep Learning (DL). However, recent researches have demonstrated that sensitive information can be recovered from the shared gradient. Generative Adversarial Network (GAN), in particular, has shown to be effective in recovering such information. However, GAN based techniques require additional information, such as class labels which are generally unavailable for privacy-preserved learning. In this paper, we show that, in the FL system, image-based privacy data can be easily recovered in full from the shared gradient only via our proposed Generative Regression Neural Network (GRNN). We formulate the attack to be a regression problem and optimize two branches of the generative model by minimizing the distance between gradients. We evaluate our method on several image classification tasks. The results illustrate that our proposed GRNN outperforms state-of-the-art methods with better stability, stronger robustness, and higher accuracy. It also has no convergence requirement to the global FL model. Moreover, we demonstrate information leakage using face re-identification. Some defense strategies are also discussed in this work.
    Contrastive Active Inference. (arXiv:2110.10083v3 [cs.LG] UPDATED)
    (0 min) Active inference is a unifying theory for perception and action resting upon the idea that the brain maintains an internal model of the world by minimizing free energy. From a behavioral perspective, active inference agents can be seen as self-evidencing beings that act to fulfill their optimistic predictions, namely preferred outcomes or goals. In contrast, reinforcement learning requires human-designed rewards to accomplish any desired outcome. Although active inference could provide a more natural self-supervised objective for control, its applicability has been limited because of the shortcomings in scaling the approach to complex environments. In this work, we propose a contrastive objective for active inference that strongly reduces the computational burden in learning the agent's generative model and planning future actions. Our method performs notably better than likelihood-based active inference in image-based tasks, while also being computationally cheaper and easier to train. We compare to reinforcement learning agents that have access to human-designed reward functions, showing that our approach closely matches their performance. Finally, we also show that contrastive methods perform significantly better in the case of distractors in the environment and that our method is able to generalize goals to variations in the background.
    Efficient Estimation in NPIV Models: A Comparison of Various Neural Networks-Based Estimators. (arXiv:2110.06763v3 [econ.EM] UPDATED)
    (0 min) Artificial Neural Networks (ANNs) can be viewed as nonlinear sieves that can approximate complex functions of high dimensional variables more effectively than linear sieves. We investigate the computational performance of various ANNs in nonparametric instrumental variables (NPIV) models of moderately high dimensional covariates that are relevant to empirical economics. We present two efficient procedures for estimation and inference on a weighted average derivative (WAD): an orthogonalized plug-in with optimally-weighted sieve minimum distance (OP-OSMD) procedure and a sieve efficient score (ES) procedure. Both estimators for WAD use ANN sieves to approximate the unknown NPIV function and are root-n asymptotically normal and first-order equivalent. We provide a detailed practitioner's recipe for implementing both efficient procedures. This involves the choice of tuning parameters for the unknown NPIV, the conditional expectations and the optimal weighting function that are present in both procedures but also the choice of tuning parameters for the unknown Riesz representer in the ES procedure. We compare their finite-sample performances in various simulation designs that involve smooth NPIV function of up to 13 continuous covariates, different nonlinearities and covariate correlations. Some Monte Carlo findings include: 1) tuning and optimization are more delicate in ANN estimation; 2) given proper tuning, both ANN estimators with various architectures can perform well; 3) easier to tune ANN OP-OSMD estimators than ANN ES estimators; 4) stable inferences are more difficult to achieve with ANN (than spline) estimators; 5) there are gaps between current implementations and approximation theories. Finally, we apply ANN NPIV to estimate average partial derivatives in two empirical demand examples with multivariate covariates.
    CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs. (arXiv:2201.01863v1 [cs.LG])
    (0 min) We present CFU Playground, a full-stack open-source framework that enables rapid and iterative design of machine learning (ML) accelerators for embedded ML systems. Our toolchain tightly integrates open-source software, RTL generators, and FPGA tools for synthesis, place, and route. This full-stack development framework gives engineers access to explore bespoke architectures that are customized and co-optimized for embedded ML. The rapid, deploy-profile-optimization feedback loop lets ML hardware and software developers achieve significant returns out of a relatively small investment in customization. Using CFU Playground's design loop, we show substantial speedups (55x-75x) and design space exploration between the CPU and accelerator.
    CausalSim: Toward a Causal Data-Driven Simulator for Network Protocols. (arXiv:2201.01811v1 [cs.LG])
    (2 min) Evaluating the real-world performance of network protocols is challenging. Randomized control trials (RCT) are expensive and inaccessible to most researchers, while expert-designed simulators fail to capture complex behaviors in real networks. We present CausalSim, a data-driven simulator for network protocols that addresses this challenge. Learning network behavior from observational data is complicated due to the bias introduced by the protocols used during data collection. CausalSim uses traces from an initial RCT under a set of protocols to learn a causal network model, effectively removing the biases present in the data. Using this model, CausalSim can then simulate any protocol over the same traces (i.e., for counterfactual predictions). Key to CausalSim is the novel use of adversarial neural network training that exploits distributional invariances that are present due to the training data coming from an RCT. Our extensive evaluation of CausalSim on both real and synthetic datasets and two use cases, including more than nine months of real data from the Puffer video streaming system, shows that it provides accurate counterfactual predictions, reducing prediction error by 44% and 53% on average compared to expert-designed and standard supervised learning baselines.
    Admissible Policy Teaching through Reward Design. (arXiv:2201.02185v1 [cs.LG])
    (2 min) We study reward design strategies for incentivizing a reinforcement learning agent to adopt a policy from a set of admissible policies. The goal of the reward designer is to modify the underlying reward function cost-efficiently while ensuring that any approximately optimal deterministic policy under the new reward function is admissible and performs well under the original reward function. This problem can be viewed as a dual to the problem of optimal reward poisoning attacks: instead of forcing an agent to adopt a specific policy, the reward designer incentivizes an agent to avoid taking actions that are inadmissible in certain states. Perhaps surprisingly, and in contrast to the problem of optimal reward poisoning attacks, we first show that the reward design problem for admissible policy teaching is computationally challenging, and it is NP-hard to find an approximately optimal reward modification. We then proceed by formulating a surrogate problem whose optimal solution approximates the optimal solution to the reward design problem in our setting, but is more amenable to optimization techniques and analysis. For this surrogate problem, we present characterization results that provide bounds on the value of the optimal solution. Finally, we design a local search algorithm to solve the surrogate problem and showcase its utility using simulation-based experiments.
    Planted Dense Subgraphs in Dense Random Graphs Can Be Recovered using Graph-based Machine Learning. (arXiv:2201.01825v1 [cs.LG])
    (2 min) Multiple methods of finding the vertices belonging to a planted dense subgraph in a random dense $G(n, p)$ graph have been proposed, with an emphasis on planted cliques. Such methods can identify the planted subgraph in polynomial time, but are all limited to several subgraph structures. Here, we present PYGON, a graph neural network-based algorithm, which is insensitive to the structure of the planted subgraph. This is the first algorithm that uses advanced learning tools for recovering dense subgraphs. We show that PYGON can recover cliques of sizes $\Theta\left(\sqrt{n}\right)$, where $n$ is the size of the background graph, comparable with the state of the art. We also show that the same algorithm can recover multiple other planted subgraphs of size $\Theta\left(\sqrt{n}\right)$, in both directed and undirected graphs. We suggest a conjecture that no polynomial time PAC-learning algorithm can detect planted dense subgraphs with size smaller than $O\left(\sqrt{n}\right)$, even if in principle one could find dense subgraphs of logarithmic size.
    Graph Neural Networks for Double-Strand DNA Breaks Prediction. (arXiv:2201.01855v1 [q-bio.QM])
    (2 min) Double-strand DNA breaks (DSBs) are a form of DNA damage that can cause abnormal chromosomal rearrangements. Recent technologies based on high-throughput experiments have obvious high costs and technical challenges.Therefore, we design a graph neural network based method to predict DSBs (GraphDSB), using DNA sequence features and chromosome structure information. In order to improve the expression ability of the model, we introduce Jumping Knowledge architecture and several effective structural encoding methods. The contribution of structural information to the prediction of DSBs is verified by the experiments on datasets from normal human epidermal keratinocytes (NHEK) and chronic myeloid leukemia cell line (K562), and the ablation studies further demonstrate the effectiveness of the designed components in the proposed GraphDSB framework. Finally, we use GNNExplainer to analyze the contribution of node features and topology to DSBs prediction, and proved the high contribution of 5-mer DNA sequence features and two chromatin interaction modes.
    Elastic Product Quantization for Time Series. (arXiv:2201.01856v1 [cs.LG])
    (2 min) Analyzing numerous or long time series is difficult in practice due to the high storage costs and computational requirements. Therefore, techniques have been proposed to generate compact similarity-preserving representations of time series, enabling real-time similarity search on large in-memory data collections. However, the existing techniques are not ideally suited for assessing similarity when sequences are locally out of phase. In this paper, we propose the use of product quantization for efficient similarity-based comparison of time series under time warping. The idea is to first compress the data by partitioning the time series into equal length sub-sequences which are represented by a short code. The distance between two time series can then be efficiently approximated by pre-computed elastic distances between their codes. The partitioning into sub-sequences forces unwanted alignments, which we address with a pre-alignment step using the maximal overlap discrete wavelet transform (MODWT). To demonstrate the efficiency and accuracy of our method, we perform an extensive experimental evaluation on benchmark datasets in nearest neighbors classification and clustering applications. Overall, the proposed solution emerges as a highly efficient (both in terms of memory usage and computation time) replacement for elastic measures in time series applications.
    AIR-Net: Adaptive and Implicit Regularization Neural Network for Matrix Completion. (arXiv:2110.07557v3 [cs.LG] UPDATED)
    (2 min) The explicit low-rank regularization, e.g., nuclear norm regularization, has been widely used in imaging sciences. However, it has been found that implicit regularization outperforms explicit ones in various image processing tasks. Another issue is that the fixed explicit regularization limits the applicability to broad kinds of images since different images favor different features captured by using different explicit regularizations. As such, this paper proposes a new adaptive and implicit low-rank regularization that captures the low-rank prior dynamically from the training data. At the core of our new adaptive and implicit low-rank regularization is parameterizing the Laplacian matrix in the Dirichlet energy-based regularization with a neural network, and we call the proposed model \textit{AIR-Net}. Theoretically, we show that the adaptive regularization of AIR-Net enhances the implicit regularization and vanishes at the end of training. We validate AIR-Net's effectiveness on various benchmark tasks, indicating that the AIR-Net is particularly favorable for the scenarios when the missing entries are non-uniform. The code can be found at \href{https://github.com/lizhemin15/AIR-Net}{https://github.com/lizhemin15/AIR-Net}.
    Evaluating and Mitigating Bias in Image Classifiers: A Causal Perspective Using Counterfactuals. (arXiv:2009.08270v4 [cs.CV] UPDATED)
    (2 min) Counterfactual examples for an input -- perturbations that change specific features but not others -- have been shown to be useful for evaluating bias of machine learning models, e.g., against specific demographic groups. However, generating counterfactual examples for images is non-trivial due to the underlying causal structure on the various features of an image. To be meaningful, generated perturbations need to satisfy constraints implied by the causal model. We present a method for generating counterfactuals by incorporating a structural causal model (SCM) in an improved variant of Adversarially Learned Inference (ALI), that generates counterfactuals in accordance with the causal relationships between attributes of an image. Based on the generated counterfactuals, we show how to explain a pre-trained machine learning classifier, evaluate its bias, and mitigate the bias using a counterfactual regularizer. On the Morpho-MNIST dataset, our method generates counterfactuals comparable in quality to prior work on SCM-based counterfactuals (DeepSCM), while on the more complex CelebA dataset our method outperforms DeepSCM in generating high-quality valid counterfactuals. Moreover, generated counterfactuals are indistinguishable from reconstructed images in a human evaluation experiment and we subsequently use them to evaluate the fairness of a standard classifier trained on CelebA data. We show that the classifier is biased w.r.t. skin and hair color, and how counterfactual regularization can remove those biases.
    Towards Conditional Generation of Minimal Action Potential Pathways for Molecular Dynamics. (arXiv:2111.14053v2 [q-bio.BM] UPDATED)
    (0 min) In this paper, we utilized generative models, and reformulate it for problems in molecular dynamics (MD) simulation, by introducing an MD potential energy component to our generative model. By incorporating potential energy as calculated from TorchMD into a conditional generative framework, we attempt to construct a low-potential energy route of transformation between the helix~$\rightarrow$~coil structures of a protein. We show how to add an additional loss function to conditional generative models, motivated by potential energy of molecular configurations, and also present an optimization technique for such an augmented loss function. Our results show the benefit of this additional loss term on synthesizing realistic molecular trajectories.
    Integrating Human-in-the-loop into Swarm Learning for Decentralized Fake News Detection. (arXiv:2201.02048v1 [cs.LG])
    (2 min) Social media has become an effective platform to generate and spread fake news that can mislead people and even distort public opinion. Centralized methods for fake news detection, however, cannot effectively protect user privacy during the process of centralized data collection for training models. Moreover, it cannot fully involve user feedback in the loop of learning detection models for further enhancing fake news detection. To overcome these challenges, this paper proposed a novel decentralized method, Human-in-the-loop Based Swarm Learning (HBSL), to integrate user feedback into the loop of learning and inference for recognizing fake news without violating user privacy in a decentralized manner. It consists of distributed nodes that are able to independently learn and detect fake news on local data. Furthermore, detection models trained on these nodes can be enhanced through decentralized model merging. Experimental results demonstrate that the proposed method outperforms the state-of-the-art decentralized method in regard of detecting fake news on a benchmark dataset.
    An Evaluation Study of Generative Adversarial Networks for Collaborative Filtering. (arXiv:2201.01815v1 [cs.IR])
    (2 min) This work explores the reproducibility of CFGAN. CFGAN and its family of models (TagRec, MTPR, and CRGAN) learn to generate personalized and fake-but-realistic rankings of preferences for top-N recommendations by using previous interactions. This work successfully replicates the results published in the original paper and discusses the impact of certain differences between the CFGAN framework and the model used in the original evaluation. The absence of random noise and the use of real user profiles as condition vectors leaves the generator prone to learn a degenerate solution in which the output vector is identical to the input vector, therefore, behaving essentially as a simple autoencoder. The work further expands the experimental analysis comparing CFGAN against a selection of simple and well-known properly optimized baselines, observing that CFGAN is not consistently competitive against them despite its high computational cost. To ensure the reproducibility of these analyses, this work describes the experimental methodology and publishes all datasets and source code.
    Does entity abstraction help generative Transformers reason?. (arXiv:2201.01787v1 [cs.CL])
    (0 min) Pre-trained language models (LMs) often struggle to reason logically or generalize in a compositional fashion. Recent work suggests that incorporating external entity knowledge can improve LMs' abilities to reason and generalize. However, the effect of explicitly providing entity abstraction remains unclear, especially with recent studies suggesting that pre-trained LMs already encode some of that knowledge in their parameters. We study the utility of incorporating entity type abstractions into pre-trained Transformers and test these methods on four NLP tasks requiring different forms of logical reasoning: (1) compositional language understanding with text-based relational reasoning (CLUTRR), (2) abductive reasoning (ProofWriter), (3) multi-hop question answering (HotpotQA), and (4) conversational question answering (CoQA). We propose and empirically explore three ways to add such abstraction: (i) as additional input embeddings, (ii) as a separate sequence to encode, and (iii) as an auxiliary prediction task for the model. Overall, our analysis demonstrates that models with abstract entity knowledge performs better than without it. However, our experiments also show that the benefits strongly depend on the technique used and the task at hand. The best abstraction aware models achieved an overall accuracy of 88.8% and 91.8% compared to the baseline model achieving 62.3% and 89.8% on CLUTRR and ProofWriter respectively. In addition, abstraction-aware models showed improved compositional generalization in both interpolation and extrapolation settings. However, for HotpotQA and CoQA, we find that F1 scores improve by only 0.5% on average. Our results suggest that the benefit of explicit abstraction is significant in formally defined logical reasoning settings requiring many reasoning hops, but point to the notion that it is less beneficial for NLP tasks having less formal logical structure.
    Federated Learning of Molecular Properties with Graph Neural Networks in a Heterogeneous Setting. (arXiv:2109.07258v2 [cs.LG] UPDATED)
    (2 min) Chemistry research has both high material and computational costs to conduct experiments. Institutions thus consider chemical data to be valuable and there have been few efforts to construct large public datasets for machine learning. Another challenge is that different intuitions are interested in different classes of molecules, creating heterogeneous data that cannot be easily joined by conventional distributed training. In this work, we introduce federated heterogeneous molecular learning to address these challenges. Federated learning allows end-users to build a global model collaboratively while keeping the training data distributed over isolated clients. Due to the lack of related research, we first simulate a heterogeneous federated learning benchmark (FedChem) by jointly performing scaffold splitting and latent Dirichlet allocation on existing datasets for heterogeneously distributed client data. Our results on FedChem show that significant learning challenges arise when working with heterogeneous molecules across clients. We then propose a method to alleviate the problem, namely Federated Learning by Instance reweighTing (FLIT(+)). FLIT(+) can align the local training across heterogeneous clients by improving the performance for uncertain samples. Comprehensive experiments conducted on our new benchmark FedChem validate the advantages of this method over other federated learning schemes. FedChem should enable a new type of collaboration for improving AI in chemistry that mitigates concerns about valuable chemical data.
    Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning. (arXiv:2201.01771v1 [cs.SD])
    (2 min) Annotating musical beats is a very long in tedious process. In order to combat this problem, we present a new self-supervised learning pretext task for beat tracking and downbeat estimation. This task makes use of Spleeter, an audio source separation model, to separate a song's drums from the rest of its signal. The first set of signals are used as positives, and by extension negatives, for contrastive learning pre-training. The drum-less signals, on the other hand, are used as anchors. When pre-training a fully-convolutional and recurrent model using this pretext task, an onset function is learned. In some cases, this function was found to be mapped to periodic elements in a song. We found that pre-trained models outperformed randomly initialized models when a beat tracking training set was extremely small (less than 10 examples). When that was not the case, pre-training led to a learning speed-up that caused the model to overfit to the training set. More generally, this work defines new perspectives in the realm of musical self-supervised learning. It is notably one of the first works to use audio source separation as a fundamental component of self-supervision.
    An Abstraction-Refinement Approach to Verifying Convolutional Neural Networks. (arXiv:2201.01978v1 [cs.LG])
    (2 min) Convolutional neural networks have gained vast popularity due to their excellent performance in the fields of computer vision, image processing, and others. Unfortunately, it is now well known that convolutional networks often produce erroneous results - for example, minor perturbations of the inputs of these networks can result in severe classification errors. Numerous verification approaches have been proposed in recent years to prove the absence of such errors, but these are typically geared for fully connected networks and suffer from exacerbated scalability issues when applied to convolutional networks. To address this gap, we present here the Cnn-Abs framework, which is particularly aimed at the verification of convolutional networks. The core of Cnn-Abs is an abstraction-refinement technique, which simplifies the verification problem through the removal of convolutional connections in a way that soundly creates an over-approximation of the original problem; and which restores these connections if the resulting problem becomes too abstract. Cnn-Abs is designed to use existing verification engines as a backend, and our evaluation demonstrates that it can significantly boost the performance of a state-of-the-art DNN verification engine, reducing runtime by 15.7% on average.
    Deep Learning Based Classification System For Recognizing Local Spinach. (arXiv:2201.02093v1 [cs.CV])
    (2 min) A deep learning model gives an incredible result for image processing by studying from the trained dataset. Spinach is a leaf vegetable that contains vitamins and nutrients. In our research, a Deep learning method has been used that can automatically identify spinach and this method has a dataset of a total of five species of spinach that contains 3785 images. Four Convolutional Neural Network (CNN) models were used to classify our spinach. These models give more accurate results for image classification. Before applying these models there is some preprocessing of the image data. For the preprocessing of data, some methods need to happen. Those are RGB conversion, filtering, resize & rescaling, and categorization. After applying these methods image data are pre-processed and ready to be used in the classifier algorithms. The accuracy of these classifiers is in between 98.68% - 99.79%. Among those models, VGG16 achieved the highest accuracy of 99.79%.
    Topological Representations of Local Explanations. (arXiv:2201.02155v1 [cs.LG])
    (2 min) Local explainability methods -- those which seek to generate an explanation for each prediction -- are becoming increasingly prevalent due to the need for practitioners to rationalize their model outputs. However, comparing local explainability methods is difficult since they each generate outputs in various scales and dimensions. Furthermore, due to the stochastic nature of some explainability methods, it is possible for different runs of a method to produce contradictory explanations for a given observation. In this paper, we propose a topology-based framework to extract a simplified representation from a set of local explanations. We do so by first modeling the relationship between the explanation space and the model predictions as a scalar function. Then, we compute the topological skeleton of this function. This topological skeleton acts as a signature for such functions, which we use to compare different explanation methods. We demonstrate that our framework can not only reliably identify differences between explainability techniques but also provides stable representations. Then, we show how our framework can be used to identify appropriate parameters for local explainability methods. Our framework is simple, does not require complex optimizations, and can be broadly applied to most local explanation methods. We believe the practicality and versatility of our approach will help promote topology-based approaches as a tool for understanding and comparing explanation methods.
    A Hybrid Quantum-Classical Neural Network Architecture for Binary Classification. (arXiv:2201.01820v1 [cs.LG])
    (2 min) Deep learning is one of the most successful and far-reaching strategies used in machine learning today. However, the scale and utility of neural networks is still greatly limited by the current hardware used to train them. These concerns have become increasingly pressing as conventional computers quickly approach physical limitations that will slow performance improvements in years to come. For these reasons, scientists have begun to explore alternative computing platforms, like quantum computers, for training neural networks. In recent years, variational quantum circuits have emerged as one of the most successful approaches to quantum deep learning on noisy intermediate scale quantum devices. We propose a hybrid quantum-classical neural network architecture where each neuron is a variational quantum circuit. We empirically analyze the performance of this hybrid neural network on a series of binary classification data sets using a simulated universal quantum computer and a state of the art universal quantum computer. On simulated hardware, we observe that the hybrid neural network achieves roughly 10% higher classification accuracy and 20% better minimization of cost than an individual variational quantum circuit. On quantum hardware, we observe that each model only performs well when the qubit and gate count is sufficiently small.
    Mixture of basis for interpretable continual learning with distribution shifts. (arXiv:2201.01853v1 [cs.LG])
    (2 min) Continual learning in environments with shifting data distributions is a challenging problem with several real-world applications. In this paper we consider settings in which the data distribution(task) shifts abruptly and the timing of these shifts are not known. Furthermore, we consider a semi-supervised task-agnostic setting in which the learning algorithm has access to both task-segmented and unsegmented data for offline training. We propose a novel approach called mixture of Basismodels (MoB) for addressing this problem setting. The core idea is to learn a small set of basis models and to construct a dynamic, task-dependent mixture of the models to predict for the current task. We also propose a new methodology to detect observations that are out-of-distribution with respect to the existing basis models and to instantiate new models as needed. We test our approach in multiple domains and show that it attains better prediction error than existing methods in most cases while using fewer models than other multiple model approaches. Moreover, we analyze the latent task representations learned by MoB and show that similar tasks tend to cluster in the latent space and that the latent representation shifts at the task boundaries when tasks are dissimilar.
    Introducing Randomized High Order Fuzzy Cognitive Maps as Reservoir Computing Models: A Case Study in Solar Energy and Load Forecasting. (arXiv:2201.02158v1 [cs.AI])
    (2 min) Fuzzy Cognitive Maps (FCMs) have emerged as an interpretable signed weighted digraph method consisting of nodes (concepts) and weights which represent the dependencies among the concepts. Although FCMs have attained considerable achievements in various time series prediction applications, designing an FCM model with time-efficient training method is still an open challenge. Thus, this paper introduces a novel univariate time series forecasting technique, which is composed of a group of randomized high order FCM models labeled R-HFCM. The novelty of the proposed R-HFCM model is relevant to merging the concepts of FCM and Echo State Network (ESN) as an efficient and particular family of Reservoir Computing (RC) models, where the least squares algorithm is applied to train the model. From another perspective, the structure of R-HFCM consists of the input layer, reservoir layer, and output layer in which only the output layer is trainable while the weights of each sub-reservoir components are selected randomly and keep constant during the training process. As case studies, this model considers solar energy forecasting with public data for Brazilian solar stations as well as Malaysia dataset, which includes hourly electric load and temperature data of the power supply company of the city of Johor in Malaysia. The experiment also includes the effect of the map size, activation function, the presence of bias and the size of the reservoir on the accuracy of R-HFCM method. The obtained results confirm the outperformance of the proposed R-HFCM model in comparison to the other methods. This study provides evidence that FCM can be a new way to implement a reservoir of dynamics in time series modelling.
    Deep Fusion of Lead-lag Graphs:Application to Cryptocurrencies. (arXiv:2201.02040v1 [cs.LG])
    (2 min) The study of time series has motivated many researchers, particularly on the area of multivariate-analysis. The study of co-movements and dependency between random variables leads us to develop metrics to describe existing connection between assets. The most commonly used are correlation and causality. Despite the growing literature, some connections remained still undetected. The objective of this paper is to propose a new representation learning algorithm capable to integrate synchronous and asynchronous relationships.
    Direct multi-modal inversion of geophysical logs using deep learning. (arXiv:2201.01871v1 [physics.geo-ph])
    (2 min) Geosteering of wells requires fast interpretation of geophysical logs which is a non-unique inverse problem. Current work presents a proof-of-concept approach to multi-modal probabilistic inversion of logs using a single evaluation of an artificial deep neural network (DNN). A mixture density DNN (MDN) is trained using the "multiple-trajectory-prediction" (MTP) loss functions, which avoids mode collapse typical for traditional MDNs, and allows multi-modal prediction ahead of data. The proposed approach is verified on the real-time stratigraphic inversion of gamma-ray logs. The multi-modal predictor outputs several likely inverse solutions/predictions, providing more accurate and realistic solutions compared to a deterministic regression using a DNN. For these likely stratigraphic curves, the model simultaneously predicts their probabilities, which are implicitly learned from the training geological data. The stratigraphy predictions and their probabilities obtained in milliseconds from the MDN can enable better real-time decisions under geological uncertainties.
    Treehouse: A Case For Carbon-Aware Datacenter Software. (arXiv:2201.02120v1 [cs.DC])
    (2 min) The end of Dennard scaling and the slowing of Moore's Law has put the energy use of datacenters on an unsustainable path. Datacenters are already a significant fraction of worldwide electricity use, with application demand scaling at a rapid rate. We argue that substantial reductions in the carbon intensity of datacenter computing are possible with a software-centric approach: by making energy and carbon visible to application developers on a fine-grained basis, by modifying system APIs to make it possible to make informed trade offs between performance and carbon emissions, and by raising the level of application programming to allow for flexible use of more energy efficient means of compute and storage. We also lay out a research agenda for systems software to reduce the carbon footprint of datacenter computing.
    Point-Set Kernel Clustering. (arXiv:2002.05815v2 [cs.LG] UPDATED)
    (2 min) Measuring similarity between two objects is the core operation in existing clustering algorithms in grouping similar objects into clusters. This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects. The proposed clustering procedure utilizes this new measure to characterize every cluster grown from a seed object. We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets. In contrast, existing clustering algorithms are either efficient or effective. In comparison with the state-of-the-art density-peak clustering and scalable kernel k-means clustering, we show that the proposed algorithm is more effective and runs orders of magnitude faster when applying to datasets of millions of data points, on a commonly used computing machine.
    DeepMLS: Geometry-Aware Control Point Deformation. (arXiv:2201.01873v1 [cs.GR])
    (2 min) We introduce DeepMLS, a space-based deformation technique, guided by a set of displaced control points. We leverage the power of neural networks to inject the underlying shape geometry into the deformation parameters. The goal of our technique is to enable a realistic and intuitive shape deformation. Our method is built upon moving least-squares (MLS), since it minimizes a weighted sum of the given control point displacements. Traditionally, the influence of each control point on every point in space (i.e., the weighting function) is defined using inverse distance heuristics. In this work, we opt to learn the weighting function, by training a neural network on the control points from a single input shape, and exploit the innate smoothness of neural networks. Our geometry-aware control point deformation is agnostic to the surface representation and quality; it can be applied to point clouds or meshes, including non-manifold and disconnected surface soups. We show that our technique facilitates intuitive piecewise smooth deformations, which are well suited for manufactured objects. We show the advantages of our approach compared to existing surface and space-based deformation techniques, both quantitatively and qualitatively.
    Efficiently Disentangle Causal Representations. (arXiv:2201.01942v1 [cs.LG])
    (0 min) This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach, which relies on the learner's adaptation speed to new distribution, the proposed approach only requires evaluating the model's generalization ability. We provide a theoretical explanation for the advantage of the proposed method, and our experiments show that the proposed technique is 1.9--11.0$\times$ more sample efficient and 9.4--32.4 times quicker than the previous method on various tasks. The source code is available at \url{https://github.com/yuanpeng16/EDCR}.
    Learning Optimal Antenna Tilt Control Policies: A Contextual Linear Bandit Approach. (arXiv:2201.02169v1 [cs.LG])
    (2 min) Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity. In this paper, we devise algorithms learning optimal tilt control policies from existing data (in the so-called passive learning setting) or from data actively generated by the algorithms (the active learning setting). We formalize the design of such algorithms as a Best Policy Identification (BPI) problem in Contextual Linear Multi-Arm Bandits (CL-MAB). An arm represents an antenna tilt update; the context captures current network conditions; the reward corresponds to an improvement of performance, mixing coverage and capacity; and the objective is to identify, with a given level of confidence, an approximately optimal policy (a function mapping the context to an arm with maximal reward). For CL-MAB in both active and passive learning settings, we derive information-theoretical lower bounds on the number of samples required by any algorithm returning an approximately optimal policy with a given level of certainty, and devise algorithms achieving these fundamental limits. We apply our algorithms to the Remote Electrical Tilt (RET) optimization problem in cellular networks, and show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
    NumHTML: Numeric-Oriented Hierarchical Transformer Model for Multi-task Financial Forecasting. (arXiv:2201.01770v1 [cs.LG])
    (0 min) Financial forecasting has been an important and active area of machine learning research because of the challenges it presents and the potential rewards that even minor improvements in prediction accuracy or forecasting may entail. Traditionally, financial forecasting has heavily relied on quantitative indicators and metrics derived from structured financial statements. Earnings conference call data, including text and audio, is an important source of unstructured data that has been used for various prediction tasks using deep earning and related approaches. However, current deep learning-based methods are limited in the way that they deal with numeric data; numbers are typically treated as plain-text tokens without taking advantage of their underlying numeric structure. This paper describes a numeric-oriented hierarchical transformer model to predict stock returns, and financial risk using multi-modal aligned earnings calls data by taking advantage of the different categories of numbers (monetary, temporal, percentages etc.) and their magnitude. We present the results of a comprehensive evaluation of NumHTML against several state-of-the-art baselines using a real-world publicly available dataset. The results indicate that NumHTML significantly outperforms the current state-of-the-art across a variety of evaluation metrics and that it has the potential to offer significant financial gains in a practical trading context.
    Learning Visible Connectivity Dynamics for Cloth Smoothing. (arXiv:2105.10389v4 [cs.RO] UPDATED)
    (2 min) Robotic manipulation of cloth remains challenging for robotics due to the complex dynamics of the cloth, lack of a low-dimensional state representation, and self-occlusions. In contrast to previous model-based approaches that learn a pixel-based dynamics model or a compressed latent vector dynamics, we propose to learn a particle-based dynamics model from a partial point cloud observation. To overcome the challenges of partial observability, we infer which visible points are connected on the underlying cloth mesh. We then learn a dynamics model over this visible connectivity graph. Compared to previous learning-based approaches, our model poses strong inductive bias with its particle based representation for learning the underlying cloth physics; it is invariant to visual features; and the predictions can be more easily visualized. We show that our method greatly outperforms previous state-of-the-art model-based and model-free reinforcement learning methods in simulation. Furthermore, we demonstrate zero-shot sim-to-real transfer where we deploy the model trained in simulation on a Franka arm and show that the model can successfully smooth different types of cloth from crumpled configurations. Videos can be found on our project website.
    Explainable AI for engineering design: A unified approach of systems engineering and component-based deep learning. (arXiv:2108.13836v2 [cs.LG] UPDATED)
    (2 min) Data-driven models created by machine learning gain in importance in all fields of design and engineering. They have high potential to assists decision-makers in creating novel artefacts with a better performance and sustainability. However, limited generalization and the black-box nature of these models induce limited explainability and reusability. These drawbacks provide significant barriers retarding adoption in engineering design. To overcome this situation, we propose a component-based approach to create partial component models by machine learning (ML). This component-based approach aligns deep learning to systems engineering (SE). By means of the example of energy efficient building design, we first demonstrate generalization of the component-based method by accurately predicting the performance of designs with random structure different from training data. Second, we illustrate explainability by local sampling, sensitivity information and rules derived from low-depth decision trees and by evaluating this information from an engineering design perspective. The key for explainability is that activations at interfaces between the components are interpretable engineering quantities. In this way, the hierarchical component system forms a deep neural network (DNN) that directly integrates information for engineering explainability. The large range of possible configurations in composing components allows the examination of novel unseen design cases with understandable data-driven models. The matching of parameter ranges of components by similar probability distribution produces reusable, well-generalizing, and trustworthy models. The approach adapts the model structure to engineering methods of systems engineering and domain knowledge.
    Combining Reinforcement Learning and Inverse Reinforcement Learning for Asset Allocation Recommendations. (arXiv:2201.01874v1 [cs.LG])
    (2 min) We suggest a simple practical method to combine the human and artificial intelligence to both learn best investment practices of fund managers, and provide recommendations to improve them. Our approach is based on a combination of Inverse Reinforcement Learning (IRL) and RL. First, the IRL component learns the intent of fund managers as suggested by their trading history, and recovers their implied reward function. At the second step, this reward function is used by a direct RL algorithm to optimize asset allocation decisions. We show that our method is able to improve over the performance of individual fund managers.
    Uncertainty Quantification Techniques for Space Weather Modeling: Thermospheric Density Application. (arXiv:2201.02067v1 [cs.LG])
    (2 min) Machine learning (ML) has often been applied to space weather (SW) problems in recent years. SW originates from solar perturbations and is comprised of the resulting complex variations they cause within the systems between the Sun and Earth. These systems are tightly coupled and not well understood. This creates a need for skillful models with knowledge about the confidence of their predictions. One example of such a dynamical system is the thermosphere, the neutral region of Earth's upper atmosphere. Our inability to forecast it has severe repercussions in the context of satellite drag and collision avoidance operations for objects in low Earth orbit. Even with (assumed) perfect driver forecasts, our incomplete knowledge of the system results in often inaccurate neutral mass density predictions. Continuing efforts are being made to improve model accuracy, but density models rarely provide estimates of uncertainty. In this work, we propose two techniques to develop nonlinear ML models to predict thermospheric density while providing calibrated uncertainty estimates: Monte Carlo (MC) dropout and direct prediction of the probability distribution, both using the negative logarithm of predictive density (NLPD) loss function. We show the performance for models trained on local and global datasets. This shows that NLPD provides similar results for both techniques but the direct probability method has a much lower computational cost. For the global model regressed on the SET HASDM density database, we achieve errors of 11% on independent test data with well-calibrated uncertainty estimates. Using an in-situ CHAMP density dataset, both techniques provide test error on the order of 13%. The CHAMP models (on independent data) are within 2% of perfect calibration for all prediction intervals tested. This model can also be used to obtain global predictions with uncertainties at a given epoch.
    Multi-Label Classification on Remote-Sensing Images. (arXiv:2201.01971v1 [cs.CV])
    (2 min) Acquiring information on large areas on the earth's surface through satellite cameras allows us to see much more than we can see while standing on the ground. This assists us in detecting and monitoring the physical characteristics of an area like land-use patterns, atmospheric conditions, forest cover, and many unlisted aspects. The obtained images not only keep track of continuous natural phenomena but are also crucial in tackling the global challenge of severe deforestation. Among which Amazon basin accounts for the largest share every year. Proper data analysis would help limit detrimental effects on the ecosystem and biodiversity with a sustainable healthy atmosphere. This report aims to label the satellite image chips of the Amazon rainforest with atmospheric and various classes of land cover or land use through different machine learning and superior deep learning models. Evaluation is done based on the F2 metric, while for loss function, we have both sigmoid cross-entropy as well as softmax cross-entropy. Images are fed indirectly to the machine learning classifiers after only features are extracted using pre-trained ImageNet architectures. Whereas for deep learning models, ensembles of fine-tuned ImageNet pre-trained models are used via transfer learning. Our best score was achieved so far with the F2 metric is 0.927.
    SABLAS: Learning Safe Control for Black-box Dynamical Systems. (arXiv:2201.01918v1 [cs.LG])
    (2 min) Control certificates based on barrier functions have been a powerful tool to generate probably safe control policies for dynamical systems. However, existing methods based on barrier certificates are normally for white-box systems with differentiable dynamics, which makes them inapplicable to many practical applications where the system is a black-box and cannot be accurately modeled. On the other side, model-free reinforcement learning (RL) methods for black-box systems suffer from lack of safety guarantees and low sampling efficiency. In this paper, we propose a novel method that can learn safe control policies and barrier certificates for black-box dynamical systems, without requiring for an accurate system model. Our method re-designs the loss function to back-propagate gradient to the control policy even when the black-box dynamical system is non-differentiable, and we show that the safety certificates hold on the black-box system. Empirical results in simulation show that our method can significantly improve the performance of the learned policies by achieving nearly 100% safety and goal reaching rates using much fewer training samples, compared to state-of-the-art black-box safe control methods. Our learned agents can also generalize to unseen scenarios while keeping the original performance. The source code can be found at https://github.com/Zengyi-Qin/bcbf.
    Neural Architecture Search for Inversion. (arXiv:2201.01772v1 [cs.LG])
    (2 min) Over the year, people have been using deep learning to tackle inversion problems, and we see the framework has been applied to build relationship between recording wavefield and velocity (Yang et al., 2016). Here we will extend the work from 2 perspectives, one is deriving a more appropriate loss function, as we now, pixel-2-pixel comparison might not be the best choice to characterize image structure, and we will elaborate on how to construct cost function to capture high level feature to enhance the model performance. Another dimension is searching for the more appropriate neural architecture, which is a subset of an even bigger picture, the automatic machine learning, or AutoML. There are several famous networks, U-net, ResNet (He et al., 2016) and DenseNet (Huang et al., 2017), and they achieve phenomenal results for certain problems, yet it's hard to argue they are the best for inversion problems without thoroughly searching within certain space. Here we will be showing our architecture search results for inversion.
  • stat.ML updates on arXiv.org

    Machine Learning for Variance Reduction in Online Experiments. (arXiv:2106.07263v3 [stat.ML] UPDATED)
    (0 min) We consider the problem of variance reduction in randomized controlled trials, through the use of covariates correlated with the outcome but independent of the treatment. We propose a machine learning regression-adjusted treatment effect estimator, which we call MLRATE. MLRATE uses machine learning predictors of the outcome to reduce estimator variance. It employs cross-fitting to avoid overfitting biases, and we prove consistency and asymptotic normality under general conditions. MLRATE is robust to poor predictions from the machine learning step: if the predictions are uncorrelated with the outcomes, the estimator performs asymptotically no worse than the standard difference-in-means estimator, while if predictions are highly correlated with outcomes, the efficiency gains are large. In A/A tests, for a set of 48 outcome metrics commonly monitored in Facebook experiments the estimator has over 70% lower variance than the simple difference-in-means estimator, and about 19% lower variance than the common univariate procedure which adjusts only for pre-experiment values of the outcome.
    Point-Set Kernel Clustering. (arXiv:2002.05815v2 [cs.LG] UPDATED)
    (2 min) Measuring similarity between two objects is the core operation in existing clustering algorithms in grouping similar objects into clusters. This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects. The proposed clustering procedure utilizes this new measure to characterize every cluster grown from a seed object. We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets. In contrast, existing clustering algorithms are either efficient or effective. In comparison with the state-of-the-art density-peak clustering and scalable kernel k-means clustering, we show that the proposed algorithm is more effective and runs orders of magnitude faster when applying to datasets of millions of data points, on a commonly used computing machine.
    HuSpaCy: an industrial-strength Hungarian natural language processing toolkit. (arXiv:2201.01956v1 [cs.CL])
    (2 min) Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industryready Hungarian language processing pipeline. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components which means that it is fast, has a rich ecosystem of NLP applications and extensions, comes with extensive documentation and a well-known API. Besides the overview of the underlying models, we also present rigorous evaluation on common benchmark datasets. Our experiments confirm that HuSpaCy has high accuracy in all subtasks while maintaining resource-efficient prediction capabilities.
    Randomized Spectral Clustering in Large-Scale Stochastic Block Models. (arXiv:2002.00839v3 [cs.SI] UPDATED)
    (2 min) Spectral clustering has been one of the widely used methods for community detection in networks. However, large-scale networks bring computational challenges to the eigenvalue decomposition therein. In this paper, we study the spectral clustering using randomized sketching algorithms from a statistical perspective, where we typically assume the network data are generated from a stochastic block model that is not necessarily of full rank. To do this, we first use the recently developed sketching algorithms to obtain two randomized spectral clustering algorithms, namely, the random projection-based and the random sampling-based spectral clustering. Then we study the theoretical bounds of the resulting algorithms in terms of the approximation error for the population adjacency matrix, the misclassification error, and the estimation error for the link probability matrix. It turns out that, under mild conditions, the randomized spectral clustering algorithms lead to the same theoretical bounds as those of the original spectral clustering algorithm. We also extend the results to degree-corrected stochastic block models. Numerical experiments support our theoretical findings and show the efficiency of randomized methods. A new R package called Rclust is developed and made available to the public.
    An Homogeneous Unbalanced Regularized Optimal Transport model with applications to Optimal Transport with Boundary. (arXiv:2201.02082v1 [math.OC])
    (2 min) This work studies how the introduction of the entropic regularization term in unbalanced Optimal Transport (OT) models may alter their homogeneity with respect to the input measures. We observe that in common settings (including balanced OT and unbalanced OT with Kullback-Leibler divergence to the marginals), although the optimal transport cost itself is not homogeneous, optimal transport plans and the so-called Sinkhorn divergences are indeed homogeneous. However, homogeneity does not hold in more general Unbalanced Regularized Optimal Transport (UROT) models, for instance those using the Total Variation as divergence to the marginals. We propose to modify the entropic regularization term to retrieve an UROT model that is homogeneous while preserving most properties of the standard UROT model. We showcase the importance of using our Homogeneous UROT (HUROT) model when it comes to regularize Optimal Transport with Boundary, a transportation model involving a spatially varying divergence to the marginals for which the standard (inhomogeneous) UROT model would yield inappropriate behavior.
    Efficient Vertex-Oriented Polytopic Projection for Web-scale Applications. (arXiv:2103.05277v3 [cs.AI] UPDATED)
    (2 min) We consider applications involving a large set of instances of projecting points to polytopes. We develop an intuition guided by theoretical and empirical analysis to show that when these instances follow certain structures, a large majority of the projections lie on vertices of the polytopes. To do these projections efficiently we derive a vertex-oriented incremental algorithm to project a point onto any arbitrary polytope, as well as give specific algorithms to cater to simplex projection and polytopes where the unit box is cut by planes. Such settings are especially useful in web-scale applications such as optimal matching or allocation problems. Several such problems in internet marketplaces (e-commerce, ride-sharing, food delivery, professional services, advertising, etc.), can be formulated as Linear Programs (LP) with such polytope constraints that require a projection step in the overall optimization process. We show that in the very recent work, the polytopic projection is the most expensive step and our efficient projection algorithms help in gaining massive improvements in performance.
    Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks. (arXiv:2006.07002v7 [cs.LG] UPDATED)
    (2 min) We study the transfer learning process between two linear regression problems. An important and timely special case is when the regressors are overparameterized and perfectly interpolate their training data. We examine a parameter transfer mechanism whereby a subset of the parameters of the target task solution are constrained to the values learned for a related source task. We analytically characterize the generalization error of the target task in terms of the salient factors in the transfer learning architecture, i.e., the number of examples available, the number of (free) parameters in each of the tasks, the number of parameters transferred from the source to target task, and the relation between the two tasks. Our non-asymptotic analysis shows that the generalization error of the target task follows a two-dimensional double descent trend (with respect to the number of free parameters in each of the tasks) that is controlled by the transfer learning factors. Our analysis points to specific cases where the transfer of parameters is beneficial as a substitute for extra overparameterization (i.e., additional free parameters in the target task). Specifically, we show that the usefulness of a transfer learning setting is fragile and depends on a delicate interplay among the set of transferred parameters, the relation between the tasks, and the true solution. We also demonstrate that overparameterized transfer learning is not necessarily more beneficial when the source task is closer or identical to the target task.
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v4 [stat.ML] UPDATED)
    (2 min) We derive conditions for the existence of fixed points of nonnegative neural networks, an important research objective to understand the behavior of neural networks in modern applications involving autoencoders and loop unrolling techniques, among others. In particular, we show that neural networks with nonnegative inputs and nonnegative parameters can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to derive conditions for the existence of a nonempty fixed point set of the nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis, which are typically based on the assumption of nonexpansivity of the activation functions. Furthermore, we prove that the shape of the fixed point set of monotonic and weakly scalable neural networks is often an interval, which degenerates to a point for the case of scalable networks. The chief results of this paper are verified in numerical simulations, where we consider an autoencoder-type network that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signals.
    Disentangling homophily, community structure and triadic closure in networks. (arXiv:2101.02510v3 [cs.SI] UPDATED)
    (2 min) Network homophily, the tendency of similar nodes to be connected, and transitivity, the tendency of two nodes being connected if they share a common neighbor, are conflated properties in network analysis, since one mechanism can drive the other. Here we present a generative model and corresponding inference procedure that are capable of distinguishing between both mechanisms. Our approach is based on a variation of the stochastic block model (SBM) with the addition of triadic closure edges, and its inference can identify the most plausible mechanism responsible for the existence of every edge in the network, in addition to the underlying community structure itself. We show how the method can evade the detection of spurious communities caused solely by the formation of triangles in the network, and how it can improve the performance of edge prediction when compared to the pure version of the SBM without triadic closure.
    Cooperative learning for multi-view analysis. (arXiv:2112.12337v3 [stat.ME] UPDATED)
    (2 min) We propose a new method for supervised learning with multiple sets of features ("views"). Cooperative learning combines the usual squared error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g. lasso, random forests, boosting, neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty. The method can be especially powerful when the different data views share some underlying relationship in their signals that we aim to strengthen, while each view has its idiosyncratic noise that we aim to reduce. We illustrate the effectiveness of our proposed method on simulated and real data examples.
    Reliability Estimation of an Advanced Nuclear Fuel using Coupled Active Learning, Multifidelity Modeling, and Subset Simulation. (arXiv:2201.02172v1 [stat.AP])
    (2 min) Tristructural isotropic (TRISO)-coated particle fuel is a robust nuclear fuel and determining its reliability is critical for the success of advanced nuclear technologies. However, TRISO failure probabilities are small and the associated computational models are expensive. We used coupled active learning, multifidelity modeling, and subset simulation to estimate the failure probabilities of TRISO fuels using several 1D and 2D models. With multifidelity modeling, we replaced expensive high-fidelity (HF) model evaluations with information fusion from two low-fidelity (LF) models. For the 1D TRISO models, we considered three multifidelity modeling strategies: only Kriging, Kriging LF prediction plus Kriging correction, and deep neural network (DNN) LF prediction plus Kriging correction. While the results across these multifidelity modeling strategies compared satisfactorily, strategies employing information fusion from two LF models consistently called the HF model least often. Next, for the 2D TRISO model, we considered two multifidelity modeling strategies: DNN LF prediction plus Kriging correction (data-driven) and 1D TRISO LF prediction plus Kriging correction (physics-based). The physics-based strategy, as expected, consistently required the fewest calls to the HF model. However, the data-driven strategy had a lower overall simulation time since the DNN predictions are instantaneous, and the 1D TRISO model requires a non-negligible simulation time.
    Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent. (arXiv:2006.01738v4 [cs.LG] UPDATED)
    (2 min) We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.
    Gaussian Imagination in Bandit Learning. (arXiv:2201.01902v1 [cs.LG])
    (2 min) Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We consider an agent who is designed to attain a low information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function, but study the agent's performance when applied instead to a Bernoulli bandit. We establish a bound on the increase in Bayesian regret when an agent interacts with the Bernoulli bandit, relative to an information-theoretic bound satisfied with the Gaussian bandit. If the Gaussian prior distribution and likelihood function are sufficiently diffuse, this increase grows with the square-root of the time horizon, and thus the per-timestep increase vanishes. Our results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
    Towards a theory of out-of-distribution learning. (arXiv:2109.14501v4 [stat.ML] UPDATED)
    (2 min) What is learning? 20$^{st}$ century formalizations of learning theory -- which precipitated revolutions in artificial intelligence -- focus primarily on $\mathit{in-distribution}$ learning, that is, learning under the assumption that the training data are sampled from the same distribution as the evaluation distribution. This assumption renders these theories inadequate for characterizing 21$^{st}$ century real world data problems, which are typically characterized by evaluation distributions that differ from the training data distributions (referred to as out-of-distribution learning). We therefore make a small change to existing formal definitions of learnability by relaxing that assumption. We then introduce $\mathbf{learning\ efficiency}$ (LE) to quantify the amount a learner is able to leverage data for a given problem, regardless of whether it is an in- or out-of-distribution problem. We then define and prove the relationship between generalized notions of learnability, and show how this framework is sufficiently general to characterize transfer, multitask, meta, continual, and lifelong learning. We hope this unification helps bridge the gap between empirical practice and theoretical guidance in real world problems. Finally, because biological learning continues to outperform machine learning algorithms on certain OOD challenges, we discuss the limitations of this framework vis-\'a-vis its ability to formalize biological learning, suggesting multiple avenues for future research.
    Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. (arXiv:2003.03685v2 [stat.ME] UPDATED)
    (2 min) The paper introduces a novel conditional independence (CI) based method for linear and nonlinear, lagged and contemporaneous causal discovery from observational time series in the causally sufficient case. Existing CI-based methods such as the PC algorithm and also common methods from other frameworks suffer from low recall and partially inflated false positives for strong autocorrelation which is an ubiquitous challenge in time series. The novel method, PCMCI$^+$, extends PCMCI [Runge et al., 2019b] to include discovery of contemporaneous links. PCMCI$^+$ improves the reliability of CI tests by optimizing the choice of conditioning sets and even benefits from autocorrelation. The method is order-independent and consistent in the oracle case. A broad range of numerical experiments demonstrates that PCMCI$^+$ has higher adjacency detection power and especially more contemporaneous orientation recall compared to other methods while better controlling false positives. Optimized conditioning sets also lead to much shorter runtimes than the PC algorithm. PCMCI$^+$ can be of considerable use in many real world application scenarios where often time resolutions are too coarse to resolve time delays and strong autocorrelation is present.
    Online nonnegative CP-dictionary learning for Markovian data. (arXiv:2009.07612v3 [stat.ML] UPDATED)
    (2 min) Online Tensor Factorization (OTF) is a fundamental tool in learning low-dimensional interpretable features from streaming multi-modal data. While various algorithmic and theoretical aspects of OTF have been investigated recently, a general convergence guarantee to stationary points of the objective function without any incoherence or sparsity assumptions is still lacking even for the i.i.d. case. In this work, we introduce a novel algorithm that learns a CANDECOMP/PARAFAC (CP) basis from a given stream of tensor-valued data under general constraints, including nonnegativity constraints that induce interpretability of the learned CP basis. We prove that our algorithm converges almost surely to the set of stationary points of the objective function under the hypothesis that the sequence of data tensors is generated by an underlying Markov chain. Our setting covers the classical i.i.d. case as well as a wide range of application contexts including data streams generated by independent or MCMC sampling. Our result closes a gap between OTF and Online Matrix Factorization in global convergence analysis \commHL{for CP-decompositions}. Experimentally, we show that our algorithm converges much faster than standard algorithms for nonnegative tensor factorization tasks on both synthetic and real-world data. Also, we demonstrate the utility of our algorithm on a diverse set of examples from an image, video, and time-series data, illustrating how one may learn qualitatively different CP-dictionaries from the same tensor data by exploiting the tensor structure in multiple ways.
    Federated Optimization of Smooth Loss Functions. (arXiv:2201.01954v1 [cs.LG])
    (2 min) In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $\epsilon$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $\eta$-H\"older smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $\phi m(p/\epsilon)^{\Theta(d/\eta)}$ and that of FedAve scales like $\phi m(p/\epsilon)^{3/4}$ (neglecting sub-dominant factors), where $\phi\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.
    Robust Linear Predictions: Analyses of Uniform Concentration, Fast Rates and Model Misspecification. (arXiv:2201.01973v1 [stat.ML])
    (2 min) The problem of linear predictions has been extensively studied for the past century under pretty generalized frameworks. Recent advances in the robust statistics literature allow us to analyze robust versions of classical linear models through the prism of Median of Means (MoM). Combining these approaches in a piecemeal way might lead to ad-hoc procedures, and the restricted theoretical conclusions that underpin each individual contribution may no longer be valid. To meet these challenges coherently, in this study, we offer a unified robust framework that includes a broad variety of linear prediction problems on a Hilbert space, coupled with a generic class of loss functions. Notably, we do not require any assumptions on the distribution of the outlying data points ($\mathcal{O}$) nor the compactness of the support of the inlying ones ($\mathcal{I}$). Under mild conditions on the dual norm, we show that for misspecification level $\epsilon$, these estimators achieve an error rate of $O(\max\left\{|\mathcal{O}|^{1/2}n^{-1/2}, |\mathcal{I}|^{1/2}n^{-1} \right\}+\epsilon)$, matching the best-known rates in literature. This rate is slightly slower than the classical rates of $O(n^{-1/2})$, indicating that we need to pay a price in terms of error rates to obtain robust estimates. Additionally, we show that this rate can be improved to achieve so-called ``fast rates" under additional assumptions.
    The dynamics of representation learning in shallow, non-linear autoencoders. (arXiv:2201.02115v1 [stat.ML])
    (2 min) Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.
    Efficiently Disentangle Causal Representations. (arXiv:2201.01942v1 [cs.LG])
    (2 min) This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach, which relies on the learner's adaptation speed to new distribution, the proposed approach only requires evaluating the model's generalization ability. We provide a theoretical explanation for the advantage of the proposed method, and our experiments show that the proposed technique is 1.9--11.0$\times$ more sample efficient and 9.4--32.4 times quicker than the previous method on various tasks. The source code is available at \url{https://github.com/yuanpeng16/EDCR}.
    Yield Spread Selection in Predicting Recession Probabilities: A Machine Learning Approach. (arXiv:2101.09394v2 [econ.EM] UPDATED)
    (2 min) The literature on using yield curves to forecast recessions customarily uses 10-year--three-month Treasury yield spread without verification on the pair selection. This study investigates whether the predictive ability of spread can be improved by letting a machine learning algorithm identify the best maturity pair and coefficients. Our comprehensive analysis shows that, despite the likelihood gain, the machine learning approach does not significantly improve prediction, owing to the estimation error. This is robust to the forecasting horizon, control variable, sample period, and oversampling of the recession observations. Our finding supports the use of the 10-year--three-month spread.
    Representation Learning for Online and Offline RL in Low-rank MDPs. (arXiv:2110.04652v3 [cs.LG] UPDATED)
    (2 min) This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for FLAMBE to $\widetilde{O}( A^2 d^4 / (\epsilon^2 (1-\gamma)^{5}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $\gamma$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.
    CausalSim: Toward a Causal Data-Driven Simulator for Network Protocols. (arXiv:2201.01811v1 [cs.LG])
    (2 min) Evaluating the real-world performance of network protocols is challenging. Randomized control trials (RCT) are expensive and inaccessible to most researchers, while expert-designed simulators fail to capture complex behaviors in real networks. We present CausalSim, a data-driven simulator for network protocols that addresses this challenge. Learning network behavior from observational data is complicated due to the bias introduced by the protocols used during data collection. CausalSim uses traces from an initial RCT under a set of protocols to learn a causal network model, effectively removing the biases present in the data. Using this model, CausalSim can then simulate any protocol over the same traces (i.e., for counterfactual predictions). Key to CausalSim is the novel use of adversarial neural network training that exploits distributional invariances that are present due to the training data coming from an RCT. Our extensive evaluation of CausalSim on both real and synthetic datasets and two use cases, including more than nine months of real data from the Puffer video streaming system, shows that it provides accurate counterfactual predictions, reducing prediction error by 44% and 53% on average compared to expert-designed and standard supervised learning baselines.
    A note on efficient minimum cost adjustment sets in causal graphical models. (arXiv:2201.02037v1 [math.ST])
    (2 min) We study the selection of adjustment sets for estimating the interventional mean under an individualized treatment rule. We assume a non-parametric causal graphical model with, possibly, hidden variables and at least one adjustment set comprised of observable variables. Moreover, we assume that observable variables have positive costs associated with them. We define the cost of an observable adjustment set as the sum of the costs of the variables that comprise it. We show that in this setting there exist adjustment sets that are minimum cost optimal, in the sense that they yield non-parametric estimators of the interventional mean with the smallest asymptotic variance among those that control for observable adjustment sets that have minimum cost. Our results are based on the construction of a special flow network associated with the original causal graph. We show that a minimum cost optimal adjustment set can be found by computing a maximum flow on the network, and then finding the set of vertices that are reachable from the source by augmenting paths. The optimaladj Python package implements the algorithms introduced in this paper.
    Group structure estimation for panel data -- a general approach. (arXiv:2201.01793v1 [stat.ME])
    (2 min) Consider a panel data setting where repeated observations on individuals are available. Often it is reasonable to assume that there exist groups of individuals that share similar effects of observed characteristics, but the grouping is typically unknown in advance. We propose a novel approach to estimate such unobserved groupings for general panel data models. Our method explicitly accounts for the uncertainty in individual parameter estimates and remains computationally feasible with a large number of individuals and/or repeated measurements on each individual. The developed ideas can be applied even when individual-level data are not available and only parameter estimates together with some quantification of uncertainty are given to the researcher.
    AIR-Net: Adaptive and Implicit Regularization Neural Network for Matrix Completion. (arXiv:2110.07557v3 [cs.LG] UPDATED)
    (2 min) The explicit low-rank regularization, e.g., nuclear norm regularization, has been widely used in imaging sciences. However, it has been found that implicit regularization outperforms explicit ones in various image processing tasks. Another issue is that the fixed explicit regularization limits the applicability to broad kinds of images since different images favor different features captured by using different explicit regularizations. As such, this paper proposes a new adaptive and implicit low-rank regularization that captures the low-rank prior dynamically from the training data. At the core of our new adaptive and implicit low-rank regularization is parameterizing the Laplacian matrix in the Dirichlet energy-based regularization with a neural network, and we call the proposed model \textit{AIR-Net}. Theoretically, we show that the adaptive regularization of AIR-Net enhances the implicit regularization and vanishes at the end of training. We validate AIR-Net's effectiveness on various benchmark tasks, indicating that the AIR-Net is particularly favorable for the scenarios when the missing entries are non-uniform. The code can be found at \href{https://github.com/lizhemin15/AIR-Net}{https://github.com/lizhemin15/AIR-Net}.
    Learning in Markov Decision Processes under Constraints. (arXiv:2002.12435v5 [cs.LG] UPDATED)
    (2 min) We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step $t$, it earns a reward, and also incurs a cost-vector consisting of $M$ costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps, while simultaneously ensuring that the average values of the $M$ cost expenditures are bounded by agent-specified thresholds $c^{ub}_i,i=1,2,\ldots,M$. In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an $M+1$ dimensional regret vector that is composed of its reward regret, and $M$ cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the $i$-th component of the cost regret vector is the difference between its $i$-th cumulative cost expense and the expected cost expenditures $Tc^{ub}_i$. We prove that the expected value of the regret vector of UCRL-CMDP, is upper-bounded as $\tilde{O}\left(T^{2\slash 3}\right)$, where $T$ is the time horizon. We further show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can~\emph{tune the regret vector} according to the agent's requirements on its cost regrets.
  • John D. Cook

    Three paths converge
    (2 min) When does the equation x2 + 7 = 2n have integer solutions? This is an interesting question, but why would anyone ask it? This post looks at three paths that have led to this problem. Ramanujan Ramanujan [1] considered this problem in 1913. He found five solutions and conjectured that there were no more. Then […] Three paths converge first appeared on John D. Cook.
  • Artificial Intelligence

    Coqui Introduces ‘YourTTS’: A Zero-Sample Text-to-Speech Model With State-of-The-Art (SOTA) Results
    (1 min) Recent advancements in end-to-end deep learning models have enabled new and intriguing Text-to-Speech (TTS) use-cases with excellent natural-sounding outcomes. However, the majority of these models are trained on large datasets recorded with a single speaker in a professional setting. Expanding solutions to numerous languages and speakers is not viable for everyone in this situation. It is more challenging for low-resource languages not often studied by mainstream research. Coqui’s team has designed ‘YourTTS‘ to overcome these limits and provide zero-shot TTS to low-resource languages. It can synthesize voices in various languages and drastically reduce data requirements by transferring information between the training set. Continue Reading.... Paper: https://arxiv.org/pdf/2112.02418.pdf Github: https://github.com/coqui-ai/TTS submitted by /u/ai-lover [link] [comments]
    ThisLocationDoesNotExist
    (1 min) ​ https://preview.redd.it/d3cvvcxo1ia81.png?width=947&format=png&auto=webp&s=af6454b6f683a410f2c4b0be0fc52c1cd307ae63 Satellite images can be easily re-purposed by using GANs that can synthesize fake aerial images. Check this out: https://mayachitra-thislocationdoesnotexist.azurewebsites.net/ PS: GANs are getting stronger day by day. This post is just one of the million examples out there! submitted by /u/ArmadilloFabulous774 [link] [comments]
    Eric Jang on Robots Learning at Google and Generalization via Language
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    What is the best AI for making face edits?
    (0 min) submitted by /u/xXNOdrugsForMEXx [link] [comments]
    Biomedical relationship extraction for knowledge graph creation using machine learning
    (0 min) submitted by /u/dreadknight011 [link] [comments]
    Coursera Plus Annual Subscription is available at $100 discount until 1/13 if you fancy
    (0 min) submitted by /u/biggbrother23 [link] [comments]
    Yale University and IBM Researchers Introduce Kernel Graph Neural Networks (KerGNNs)
    (1 min) Graph kernel approaches have typically been the most popular strategy for graph classification tasks. Graph kernels can be thought of as functions that measure the similarity of two graphs. They allow kernelized learning algorithms like support vector machines to work directly on charts rather than convert them to fixed-length, real-valued feature vectors through feature extraction. In recent years, the use of Graph Neural Networks (GNNs) based on high-performance message-passing neural networks has exploded (MPNNs). As a result, they’ve grown increasingly popular for graph categorization. However, their performance is limited by their hand-crafted combinatorial features. Yale University and IBM researchers propose Kernel Graph Neural Networks (KerGNNs). KerGNNs are frameworks that combine graph kernels and the GNN message-passing procedure into one. They achieve results that are equivalent to cutting-edge approaches. Simultaneously, they vastly increase model interpretability when compared to traditional GNNs. Continue Reading.... Paper: https://arxiv.org/pdf/2201.00491.pdf ​ https://preview.redd.it/7e9twrtvpea81.png?width=1920&format=png&auto=webp&s=c96b50dac4f48ac18c913bbd0f1cf9a4cd0defc8 submitted by /u/ai-lover [link] [comments]
    Amazon Original Upload: Are we Inching Towards this Dystopian Reality with Metaverse?
    (0 min) submitted by /u/yadavvenugopal [link] [comments]
  • Machine Learning

    [D] Eric Jang on Robots Learning at Google and Generalization via Language
    (1 min) Hi there, just want to share out latest Gradient interview with Eric Jang, a research scientist on Google AI's Robotics team. He has worked a good deal with they arm farm, and more recently on imitation learning and reinforcement learning for enabling robots to generalize to many tasks. Here's the link: Eric Jang on Robots Learning at Google and Generalization via Language As usual, we get pretty technical and most of it going over his research, as well as his recent blog posts. Sections: (00:00) Intro (00:50) Start in AI / Research (03:58) Joining Google Robotics (10:08) End to End Learning of Semantic Grasping (19:11) Off Policy RL for Robotic Grasping (29:33) Grasp2Vec (40:50) Watch, Try, Learn Meta-Learning from Demonstrations and Rewards (50:12) BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning (59:41) Just Ask for Generalization (01:09:02) Data for Robotics (01:22:10) To Understand Language is to Understand Generalization (01:32:38) Outro Papers discussed: Grasp2Vec: Learning Object Representations from Self-Supervised Grasping End-to-End Learning of Semantic Grasping Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods Watch, Try, Learn Meta-Learning from Demonstrations and Rewards BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning Just Ask for Generalization To Understand Language is to Understand Generalization Robots Must Be Ephemeralized submitted by /u/regalalgorithm [link] [comments]
    Hypothetical: testing a spiked neural network in an android [D] [R] [P]
    (1 min) I am working on a hard sci fi project set several hundred years in the future. I'd like to show a data scientist running realistic (but interesting) tests on an android to see if it's core SNN is intact. Ideally, this would be some kind of behavioral simulation, a series of tests that the android has to "pass." Thought I should come to the experts for suggestions! submitted by /u/Sensitive_Necessary7 [link] [comments]
    [D] How does 10-shot work?
    (0 min) I am trying to get into contrastive learning, and I have stumbled upon benchmarks like this: https://paperswithcode.com/sota/few-shot-image-classification-on-imagenet-10 What exactly is meant by 10-shot? Is it just training on 10 random examples (selected beforehand for fairness)? How does it work? submitted by /u/AICoderGamer [link] [comments]
    [P] Weekly updated open sourced model implementations in Flax
    (1 min) Even though the popularity of JAX is increasing everyday, open sourced code implementations are still hard to find. To counter this, I have created a repository with the goal to achieve something similar to timm for torch (except for the pretrained models because I don't have the funds currently). Some features of jax_models: 1. Pip installable 2. Weekly / fortnightly model additions 3. Comprehensive tests 4. Model architectures are cross checked by checking original implementations if available or otherwise miscellaneous aspects of the paper such as model parameters The repository is currently in its pre-alpha stage with only around 6 model and layer implementations. If you are interested in this, you can contribute by either asking for specific paper implementations or by helping me train and open source weights of these models. If you like my work then please consider starring the repository and giving feedback. Thank you! Link: https://github.com/DarshanDeshpande/jax-models submitted by /u/Megixist [link] [comments]
    [R] A 5 million source code file dataset
    (1 min) I published a dataset of 5m source code files from 15k open source files. Its long term goal is to enable identifying causality in software engineering. Describing paper: End to End Software Engineering Research People from ML, NLP, causality, and software engineering, might find it interesting. The dataset enables investigating code similarity (and also text similarity in English), program difficulty, defect predictions, etc. The code is extracted every two months in order to investigate the difference. By the difference one can investigate if a change in a possible cause (e.g., smell removal) leads to influence (less bugs)and in which context. I plan to keep extending the dataset and would like to get feedback on it - ease of use, new use cases, etc. If you have related dataset that can be merged with, related data that you would like to obtain or ideas for research directions, please contact me. submitted by /u/idan_huji [link] [comments]
    [R] The PyMC devs have made their books available free online!
    (1 min) Forward: "Bayesian modeling provides an elegant approach to many data science and decision-making problems. However, it can be hard to make it work well in practice. In particular, although there are many software packages that make it easy to specify complex hierarchical models such as Stan, PyMC3, TensorFlow Probability (TFP), and Pyro, users still need additional tools to diagnose whether the results of their computations are correct or not. They may also need advice on what to do when things do go wrong. This book focuses on the ArviZ library, which enables users to perform exploratory analysis of Bayesian models, for example, diagnostics of posterior samples generated by any inference method. This can be used to diagnose a variety of failure modes in Bayesian inference. The book also discusses various modeling strategies (such as centering) that can be employed to eliminate many of the most common problems. Most of the examples in the book use PyMC3, although some also use TFP; a brief comparison of other probabilistic programming languages is also included. The authors are all experts in the area of Bayesian software and are major contributors to the PyMC3, ArviZ, and TFP libraries. They also have significant experience applying Bayesian data analysis in practice, and this is reflected in the practical approach adopted in this book. Overall, I think this is a valuable addition to the literature, which should hopefully further the adoption of Bayesian methods." Link: https://bayesiancomputationbook.com/welcome.html submitted by /u/bikeskata [link] [comments]
    ‎StyleGAN 2 and PSP running on Apple Neural Engine [P]
    (1 min) submitted by /u/surelyouarejoking [link] [comments]
    [D] Fourier transform vs NNs as function approximators
    (6 min) So this is probably a basic question. If the main premise of neural networks is that they are global function approximators, what advantage do they have against other approximators such Fourier transform, which is also proven to be able to approximate any function. Why does not the whole supervised learning field become one of calculating Fourier coefficients submitted by /u/Hazalem [link] [comments]
    [D] How often is the case that a conference submission is rejected everywhere?
    (3 min) My understanding is that if your research paper is denied from a conference, you would take the reviewer's feedback, make some changes, then submit to the next big ML conference. The process repeats until the paper is eventually accepted somewhere. But how often is it the case where the paper isn't accepted anywhere? How many papers are accepted on the first (or second) try? I'm an undergrad who's really new with the concept of publishing papers, so I thought it's worth asking around here. :) submitted by /u/akardashian [link] [comments]
    [D] Is there a complete list of modern different AI algorithms and its main application?
    (1 min) Like DNN for all domains, Reinforcement learning for? Genetic algorithm for ? Symbolic AI for ? ... submitted by /u/ghosthamlet [link] [comments]
    [D] Questions about paper "Pay Attention to MLPs"
    (1 min) In Pay Attention to MLPs, the authors said that it is necessary for the layer s(') to contain a contraction operation over the spatial dimension. It's hard to understand why s(') should be a contraction operation. I think s(') should maintain the spatial dimension for proper calculation. Why should s(') be a contraction operation? Can I say transformer dynamically parameterize spatial interactions with positional encoder? I am confused whether spatial interaction already contains the meaning of positional encoding. What do you think? Thank you in advance. submitted by /u/Spiritual_Fig3632 [link] [comments]
    Can we use data available for "non-commercial use" for training a DL model which will be deployed as a subsystem in a commercial product? [D]
    (3 min) I am having a lot of difficulties getting the right data for my task since most of the datasets out there including ImageNet restrict the usage of the data to only educational and non-commercial research. In this case, should I reach out to them to get usage permission? What are the chances that they will allow/respond and would there be a significant cost associated with this? Thanks! submitted by /u/PinPitiful [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More
    (1 min) submitted by /u/nickb [link] [comments]
    Optimization of the neural network with the use of an algorithm simulated annealing
    (1 min) Hi, Have any of you tried to optimize the neural network using the simulated annealing algorithm? I am referring to an example even on a simple neural network because I have a similar project and I will consider how to get started with the topic, so practical use would be extremely helpful ... Additionally, some scientific papers or hypothetical examples of use may also be helpful. Thanks in advance for your help. submitted by /u/theGrEaTmPm [link] [comments]
    Newbie needs help generating data charts for pattern recognition
    (1 min) Hello guys, I am a newbie to neural nets. I need your help I have 6 data patterns in a quality control chart and I need from those 6 graphs to generate more of each of them in order to use them to train a neural net do identify them in random charts. I have tried using Gaussian distribution noise but something doesn't work well. I prefer using python to generate them. submitted by /u/aherontas [link] [comments]
    Yale University and IBM Researchers Introduce Kernel Graph Neural Networks (KerGNNs)
    (1 min) Graph kernel approaches have typically been the most popular strategy for graph classification tasks. Graph kernels can be thought of as functions that measure the similarity of two graphs. They allow kernelized learning algorithms like support vector machines to work directly on charts rather than convert them to fixed-length, real-valued feature vectors through feature extraction. In recent years, the use of Graph Neural Networks (GNNs) based on high-performance message-passing neural networks has exploded (MPNNs). As a result, they’ve grown increasingly popular for graph categorization. However, their performance is limited by their hand-crafted combinatorial features. Yale University and IBM researchers propose Kernel Graph Neural Networks (KerGNNs). KerGNNs are frameworks that combine graph kernels and the GNN message-passing procedure into one. They achieve results that are equivalent to cutting-edge approaches. Simultaneously, they vastly increase model interpretability when compared to traditional GNNs. Continue Reading.... Paper: https://arxiv.org/pdf/2201.00491.pdf https://preview.redd.it/pe289oqwpea81.png?width=1920&format=png&auto=webp&s=31c2a51a4f57303dec5f41680a9e4517f4f662be submitted by /u/ai-lover [link] [comments]
    Image GANs meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering (aka 3D objects from a single image)
    (1 min) submitted by /u/nickb [link] [comments]
  • Reinforcement Learning

    Hiring: My team is hiring an 2 interns at Intel. Read post description
    (2 min) Looking for 2 interns for summer. There are few pre-reqs. Must be in US. Must enroll in a Ph.D program (Sorry no MS and BS students) Must have a track record for doing RL research, i.e 2 to 3 publications at least. If you are interested, please send me a link to your resume. Thanks Edit: hiring 2 interns submitted by /u/schrodingershit [link] [comments]
    Offline RL using Policy Gradients
    (1 min) I have a dataset of events that take place in football (soccer) games. This is being framed as a reinforcement learning issue by defining episodes to be the events (pass, shoot, dribble, etc) that take place when a team gains possession and loses it (by shooting, losing possession, etc). I have created a simple NN that estimates what the next action will be from the possible action space as a probability distribution. This NN was trained on the historical data from past matches. I would like to know if there are any good pointers into implementing Policy Gradients to nudge the probability distribution learned by the NN in the direction of some reward function that I define. Ideally the advice would be for tensorflow as that is what I am familiar with. submitted by /u/uom_questions [link] [comments]
    Can a supervised ANN detect a cat and a dog or just a cat?
    (2 min) So a NN is good at scanning a thousand labelled pictures of a cat and then when it see's a cat it has a high likelihood of knowing it's a cat (if it is trained right). Can that same nn (with same or similar configuration) also determine a cat from a dog from a range of random animals, assuming weights were adjusted and training redone? Would I need two NN's for that? Does first NN think it's a cat? No? Okay - does 2nd NN think it's a dog? Yes! submitted by /u/Togfox [link] [comments]
    RL (finance focused) Study Group
    (1 min) Hey everyone - I am looking for people (preferably MSc or PhD students/ grads) who are interested to learn applied RL in finance. I'm looking to create a group of 5-8 people to meet weekly or biweekly for around 1-2hrs to discuss papers and potential projects that we can collaborate on. Please let me know if interested. submitted by /u/ivan_426 [link] [comments]

2022-01-07

  • cs.LG updates on arXiv.org

    Using Deep Learning with Large Aggregated Datasets for COVID-19 Classification from Cough. (arXiv:2201.01669v1 [eess.AS])
    (2 min) The Covid-19 pandemic has been a scourge upon humanity, claiming the lives of more than 5 million people worldwide. Although vaccines are being distributed worldwide, there is an apparent need for affordable screening techniques to serve parts of the world that do not have access to traditional medicine. Artificial Intelligence can provide a solution utilizing cough sounds as the primary screening mode. This paper presents multiple models that have achieved relatively respectable perfor mance on the largest evaluation dataset currently presented in academic literature. Moreover, we also show that performance increases with training data size, showing the need for the world wide collection of data to help combat the Covid-19 pandemic with non-traditional means.
    Predicting trajectory behaviour via machine-learned invariant manifolds. (arXiv:2107.10154v2 [physics.chem-ph] UPDATED)
    (2 min) In this paper, we use support vector machines (SVM) to develop a machine learning framework to discover phase space structures that distinguish between distinct reaction pathways. The SVM model is trained using data from trajectories of Hamilton's equations and works well even with relatively few trajectories. Moreover, this framework is specifically designed to require minimal a priori knowledge of the dynamics in a system. This makes our approach computationally better suited than existing methods for high-dimensional systems and systems where integrating trajectories is expensive. We benchmark our approach on Chesnavich's CH$_4^+$ Hamiltonian.
    Suppression of Correlated Noise with Similarity-based Unsupervised Deep Learning. (arXiv:2011.03384v6 [cs.LG] UPDATED)
    (0 min) Image denoising is a prerequisite for downstream tasks in many fields. Low-dose and photon-counting computed tomography (CT) denoising can optimize diagnostic performance at minimized radiation dose. Supervised deep denoising methods are popular but require paired clean or noisy samples that are often unavailable in practice. Limited by the independent noise assumption, current unsupervised denoising methods cannot process correlated noises as in CT images. Here we propose the first-of-its-kind similarity-based unsupervised deep denoising approach, referred to as Noise2Sim, that works in a nonlocal and nonlinear fashion to suppress not only independent but also correlated noises. Theoretically, Noise2Sim is asymptotically equivalent to supervised learning methods under mild conditions. Experimentally, Nosie2Sim recovers intrinsic features from noisy low-dose CT and photon-counting CT images as effectively as or even better than supervised learning methods on practical datasets visually, quantitatively and statistically. Noise2Sim is a general unsupervised denoising approach and has great potential in diverse applications.
    Latte: Cross-framework Python Package for Evaluation of Latent-Based Generative Models. (arXiv:2112.10638v2 [cs.LG] UPDATED)
    (0 min) Latte (for LATent Tensor Evaluation) is a Python library for evaluation of latent-based generative models in the fields of disentanglement learning and controllable generation. Latte is compatible with both PyTorch and TensorFlow/Keras, and provides both functional and modular APIs that can be easily extended to support other deep learning frameworks. Using NumPy-based and framework-agnostic implementation, Latte ensures reproducible, consistent, and deterministic metric calculations regardless of the deep learning framework of choice.
    End-to-End Autoencoder Communications with Optimized Interference Suppression. (arXiv:2201.01388v1 [cs.IT])
    (0 min) An end-to-end communications system based on Orthogonal Frequency Division Multiplexing (OFDM) is modeled as an autoencoder (AE) for which the transmitter (coding and modulation) and receiver (demodulation and decoding) are represented as deep neural networks (DNNs) of the encoder and decoder, respectively. This AE communications approach is shown to outperform conventional communications in terms of bit error rate (BER) under practical scenarios regarding channel and interference effects as well as training data and embedded implementation constraints. A generative adversarial network (GAN) is trained to augment the training data when there is not enough training data available. Also, the performance is evaluated in terms of the DNN model quantization and the corresponding memory requirements for embedded implementation. Then, interference training and randomized smoothing are introduced to train the AE communications to operate under unknown and dynamic interference (jamming) effects on potentially multiple OFDM symbols. Relative to conventional communications, up to 36 dB interference suppression for a channel reuse of four can be achieved by the AE communications with interference training and randomized smoothing. AE communications is also extended to the multiple-input multiple-output (MIMO) case and its BER performance gain with and without interference effects is demonstrated compared to conventional MIMO communications.
    Formant Tracking Using Quasi-Closed Phase Forward-Backward Linear Prediction Analysis and Deep Neural Networks. (arXiv:2201.01525v1 [eess.AS])
    (0 min) Formant tracking is investigated in this study by using trackers based on dynamic programming (DP) and deep neural nets (DNNs). Using the DP approach, six formant estimation methods were first compared. The six methods include linear prediction (LP) algorithms, weighted LP algorithms and the recently developed quasi-closed phase forward-backward (QCP-FB) method. QCP-FB gave the best performance in the comparison. Therefore, a novel formant tracking approach, which combines benefits of deep learning and signal processing based on QCP-FB, was proposed. In this approach, the formants predicted by a DNN-based tracker from a speech frame are refined using the peaks of the all-pole spectrum computed by QCP-FB from the same frame. Results show that the proposed DNN-based tracker performed better both in detection rate and estimation error for the lowest three formants compared to reference formant trackers. Compared to the popular Wavesurfer, for example, the proposed tracker gave a reduction of 29%, 48% and 35% in the estimation error for the lowest three formants, respectively.
    A Federated Deep Learning Framework for Privacy Preservation and Communication Efficiency. (arXiv:2001.09782v3 [cs.DC] UPDATED)
    (0 min) Deep learning has achieved great success in many applications. However, its deployment in practice has been hurdled by two issues: the privacy of data that has to be aggregated centrally for model training and high communication overhead due to transmission of a large amount of data usually geographically distributed. Addressing both issues is challenging and most existing works could not provide an efficient solution. In this paper, we develop FedPC, a Federated Deep Learning Framework for Privacy Preservation and Communication Efficiency. The framework allows a model to be learned on multiple private datasets while not revealing any information of training data, even with intermediate data. The framework also minimizes the amount of data exchanged to update the model. We formally prove the convergence of the learning model when training with FedPC and its privacy-preserving property. We perform extensive experiments to evaluate the performance of FedPC in terms of the approximation to the upper-bound performance (when training centrally) and communication overhead. The results show that FedPC maintains the performance approximation of the models within $8.5\%$ of the centrally-trained models when data is distributed to 10 computing nodes. FedPC also reduces the communication overhead by up to $42.20\%$ compared to existing works.
    ROOM: Adversarial Machine Learning Attacks Under Real-Time Constraints. (arXiv:2201.01621v1 [cs.CR])
    (0 min) Advances in deep learning have enabled a wide range of promising applications. However, these systems are vulnerable to Adversarial Machine Learning (AML) attacks; adversarially crafted perturbations to their inputs could cause them to misclassify. Several state-of-the-art adversarial attacks have demonstrated that they can reliably fool classifiers making these attacks a significant threat. Adversarial attack generation algorithms focus primarily on creating successful examples while controlling the noise magnitude and distribution to make detection more difficult. The underlying assumption of these attacks is that the adversarial noise is generated offline, making their execution time a secondary consideration. However, recently, just-in-time adversarial attacks where an attacker opportunistically generates adversarial examples on the fly have been shown to be possible. This paper introduces a new problem: how do we generate adversarial noise under real-time constraints to support such real-time adversarial attacks? Understanding this problem improves our understanding of the threat these attacks pose to real-time systems and provides security evaluation benchmarks for future defenses. Therefore, we first conduct a run-time analysis of adversarial generation algorithms. Universal attacks produce a general attack offline, with no online overhead, and can be applied to any input; however, their success rate is limited because of their generality. In contrast, online algorithms, which work on a specific input, are computationally expensive, making them inappropriate for operation under time constraints. Thus, we propose ROOM, a novel Real-time Online-Offline attack construction Model where an offline component serves to warm up the online algorithm, making it possible to generate highly successful attacks under time constraints.
    Unlabeled sample compression schemes and corner peelings for ample and maximum classes. (arXiv:1812.02099v2 [cs.DM] UPDATED)
    (0 min) We examine connections between combinatorial notions that arise in machine learning and topological notions in cubical/simplicial geometry. These connections enable to export results from geometry to machine learning. Our first main result is based on a geometric construction by Tracy Hall (2004) of a partial shelling of the cross-polytope which can not be extended. We use it to derive a maximum class of VC dimension 3 that has no corners. This refutes several previous works in machine learning from the past 11 years. In particular, it implies that all previous constructions of optimal unlabeled sample compression schemes for maximum classes are erroneous. On the positive side we present a new construction of an unlabeled sample compression scheme for maximum classes. We leave as open whether our unlabeled sample compression scheme extends to ample (a.k.a. lopsided or extremal) classes, which represent a natural and far-reaching generalization of maximum classes. Towards resolving this question, we provide a geometric characterization in terms of unique sink orientations of the 1-skeletons of associated cubical complexes.
    The Effect of Model Compression on Fairness in Facial Expression Recognition. (arXiv:2201.01709v1 [cs.CV])
    (0 min) Deep neural networks have proved hugely successful, achieving human-like performance on a variety of tasks. However, they are also computationally expensive, which has motivated the development of model compression techniques which reduce the resource consumption associated with deep learning models. Nevertheless, recent studies have suggested that model compression can have an adverse effect on algorithmic fairness, amplifying existing biases in machine learning models. With this project we aim to extend those studies to the context of facial expression recognition. To do that, we set up a neural network classifier to perform facial expression recognition and implement several model compression techniques on top of it. We then run experiments on two facial expression datasets, namely the Extended Cohn-Kanade Dataset (CK+DB) and the Real-World Affective Faces Database (RAF-DB), to examine the individual and combined effect that compression techniques have on the model size, accuracy and fairness. Our experimental results show that: (i) Compression and quantisation achieve significant reduction in model size with minimal impact on overall accuracy for both CK+DB and RAF-DB; (ii) in terms of model accuracy, the classifier trained and tested on RAF-DB seems more robust to compression compared to the CK+ DB; (iii) for RAF-DB, the different compression strategies do not seem to increase the gap in predictive performance across the sensitive attributes of gender, race and age which is in contrast with the results on the CK+DB, where compression seems to amplify existing biases for gender. We analyse the results and discuss the potential reasons for our findings.
    Adversarial Feature Desensitization. (arXiv:2006.04621v3 [cs.LG] UPDATED)
    (0 min) Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths. Our code is available at https://github.com/BashivanLab/afd.
    Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning. (arXiv:2009.14457v2 [cs.CL] UPDATED)
    (0 min) Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal representations across text, layout and image dimensions for documents and (ii) inability to process multi-page documents. Pre-training techniques have been shown in Natural Language Processing (NLP) domain to learn generic textual representations from large unlabelled datasets, applicable to various downstream NLP tasks. In this paper, we propose a multi-task learning-based framework that utilizes a combination of self-supervised and supervised pre-training tasks to learn a generic document representation applicable to various downstream document tasks. Specifically, we introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks to learn rich image representations along with the text and layout representations for documents. We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion. We showcase the applicability of our pre-training framework on a variety of different real-world document tasks such as document classification, document information extraction, and document retrieval. We evaluate our framework on different standard document datasets and conduct exhaustive experiments to compare performance against various ablations of our framework and state-of-the-art baselines.
    FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches. (arXiv:2103.14539v3 [cs.LG] UPDATED)
    (0 min) The machine learning (ML) life cycle involves a series of iterative steps, from the effective gathering and preparation of the data, including complex feature engineering processes, to the presentation and improvement of results, with various algorithms to choose from in every step. Feature engineering in particular can be very beneficial for ML, leading to numerous improvements such as boosting the predictive results, decreasing computational times, reducing excessive noise, and increasing the transparency behind the decisions taken during the training. Despite that, while several visual analytics tools exist to monitor and control the different stages of the ML life cycle (especially those related to data and algorithms), feature engineering support remains inadequate. In this paper, we present FeatureEnVi, a visual analytics system specifically designed to assist with the feature engineering process. Our proposed system helps users to choose the most important feature, to transform the original features into powerful alternatives, and to experiment with different feature generation combinations. Additionally, data space slicing allows users to explore the impact of features on both local and global scales. FeatureEnVi utilizes multiple automatic feature selection techniques; furthermore, it visually guides users with statistical evidence about the influence of each feature (or subsets of features). The final outcome is the extraction of heavily engineered features, evaluated by multiple validation metrics. The usefulness and applicability of FeatureEnVi are demonstrated with two use cases and a case study. We also report feedback from interviews with two ML experts and a visualization researcher who assessed the effectiveness of our system.
    The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems. (arXiv:2006.07856v3 [cs.LG] UPDATED)
    (0 min) This paper presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of federated learning systems. We have developed reference implementations, and evaluated the important aspects of federated learning, including model accuracy, communication cost, throughput and convergence time. Through these evaluations, we discovered some interesting findings such as federated learning can effectively increase end-to-end throughput.
    Latent Space based Memory Replay for Continual Learning in Artificial Neural Networks. (arXiv:2111.13297v2 [cs.LG] UPDATED)
    (0 min) Memory replay may be key to learning in biological brains, which manage to learn new tasks continually without catastrophically interfering with previous knowledge. On the other hand, artificial neural networks suffer from catastrophic forgetting and tend to only perform well on tasks that they were recently trained on. In this work we explore the application of latent space based memory replay for classification using artificial neural networks. We are able to preserve good performance in previous tasks by storing only a small percentage of the original data in a compressed latent space version.
    Matrix Completion with Hierarchical Graph Side Information. (arXiv:2201.01728v1 [stat.ML])
    (2 min) We consider a matrix completion problem that exploits social or item similarity graphs as side information. We develop a universal, parameter-free, and computationally efficient algorithm that starts with hierarchical graph clustering and then iteratively refines estimates both on graph clustering and matrix ratings. Under a hierarchical stochastic block model that well respects practically-relevant social graphs and a low-rank rating matrix model (to be detailed), we demonstrate that our algorithm achieves the information-theoretic limit on the number of observed matrix entries (i.e., optimal sample complexity) that is derived by maximum likelihood estimation together with a lower-bound impossibility result. One consequence of this result is that exploiting the hierarchical structure of social graphs yields a substantial gain in sample complexity relative to the one that simply identifies different groups without resorting to the relational structure across them. We conduct extensive experiments both on synthetic and real-world datasets to corroborate our theoretical results as well as to demonstrate significant performance improvements over other matrix completion algorithms that leverage graph side information.
    Standard Vs Uniform Binary Search and Their Variants in Learned Static Indexing: The Case of the Searching on Sorted Data Benchmarking Software Platform. (arXiv:2201.01554v1 [cs.DS])
    (2 min) The Searching on Sorted Data ({\bf SOSD}, in short) is a highly engineered software platform for benchmarking Learned Indexes, those latter being a novel and quite effective proposal of how to search in a sorted table by combining Machine Learning techniques with classic Algorithms. In such a platform and in the related benchmarking experiments, following a natural and intuitive choice, the final search stage is performed via the Standard (textbook) Binary Search procedure. However, recent studies, that do not use Machine Learning predictions, indicate that Uniform Binary Search, streamlined to avoid \vir{branching} in the main loop, is superior in performance to its Standard counterpart when the table to be searched into is relatively small, e.g., fitting in L1 or L2 cache. Analogous results hold for k-ary Search, even on large tables. One would expect an analogous behaviour within Learned Indexes. Via a set of extensive experiments, coherent with the State of the Art, we show that for Learned Indexes, and as far as the {\bf SOSD} software is concerned, the use of the Standard routine (either Binary or k-ary Search) is superior to the Uniform one, across all the internal memory levels. This fact provides a quantitative justification of the natural choice made so far. Our experiments also indicate that Uniform Binary and k-ary Search can be advantageous to use in order to save space in Learned Indexes, while granting a good performance in time. Our findings are of methodological relevance for this novel and fast-growing area and informative to practitioners interested in using Learned Indexes in application domains, e.g., Data Bases and Search Engines.
    Fast Gradient Non-sign Methods. (arXiv:2110.12734v2 [cs.CV] UPDATED)
    (2 min) Adversarial attacks make their success in "fooling" DNNs and among them, gradient-based algorithms become one of the mainstreams. Based on the linearity hypothesis [12], under $\ell_\infty$ constraint, $sign$ operation applied to the gradients is a good choice for generating perturbations. However, the side-effect from such operation exists since it leads to the bias of direction between the real gradients and the perturbations. In other words, current methods contain a gap between real gradients and actual noises, which leads to biased and inefficient attacks. Therefore in this paper, based on the Taylor expansion, the bias is analyzed theoretically and the correction of $\sign$, i.e., Fast Gradient Non-sign Method (FGNM), is further proposed. Notably, FGNM is a general routine, which can seamlessly replace the conventional $sign$ operation in gradient-based attacks with negligible extra computational cost. Extensive experiments demonstrate the effectiveness of our methods. Specifically, ours outperform them by \textbf{27.5\%} at most and \textbf{9.5\%} on average. Our anonymous code is publicly available: \url{https://git.io/mm-fgnm}.
    Sparse Super-Regular Networks. (arXiv:2201.01363v1 [cs.LG])
    (2 min) It has been argued by Thom and Palm that sparsely-connected neural networks (SCNs) show improved performance over fully-connected networks (FCNs). Super-regular networks (SRNs) are neural networks composed of a set of stacked sparse layers of (epsilon, delta)-super-regular pairs, and randomly permuted node order. Using the Blow-up Lemma, we prove that as a result of the individual super-regularity of each pair of layers, SRNs guarantee a number of properties that make them suitable replacements for FCNs for many tasks. These guarantees include edge uniformity across all large-enough subsets, minimum node in- and out-degree, input-output sensitivity, and the ability to embed pre-trained constructs. Indeed, SRNs have the capacity to act like FCNs, and eliminate the need for costly regularization schemes like Dropout. We show that SRNs perform similarly to X-Nets via readily reproducible experiments, and offer far greater guarantees and control over network structure.
    Deep Learning-Based Sparse Whole-Slide Image Analysis for the Diagnosis of Gastric Intestinal Metaplasia. (arXiv:2201.01449v1 [eess.IV])
    (2 min) In recent years, deep learning has successfully been applied to automate a wide variety of tasks in diagnostic histopathology. However, fast and reliable localization of small-scale regions-of-interest (ROI) has remained a key challenge, as discriminative morphologic features often occupy only a small fraction of a gigapixel-scale whole-slide image (WSI). In this paper, we propose a sparse WSI analysis method for the rapid identification of high-power ROI for WSI-level classification. We develop an evaluation framework inspired by the early classification literature, in order to quantify the tradeoff between diagnostic performance and inference time for sparse analytic approaches. We test our method on a common but time-consuming task in pathology - that of diagnosing gastric intestinal metaplasia (GIM) on hematoxylin and eosin (H&E)-stained slides from endoscopic biopsy specimens. GIM is a well-known precursor lesion along the pathway to development of gastric cancer. We performed a thorough evaluation of the performance and inference time of our approach on a test set of GIM-positive and GIM-negative WSI, finding that our method successfully detects GIM in all positive WSI, with a WSI-level classification area under the receiver operating characteristic curve (AUC) of 0.98 and an average precision (AP) of 0.95. Furthermore, we show that our method can attain these metrics in under one minute on a standard CPU. Our results are applicable toward the goal of developing neural networks that can easily be deployed in clinical settings to support pathologists in quickly localizing and diagnosing small-scale morphologic features in WSI.
    Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning. (arXiv:2106.04015v3 [cs.LG] UPDATED)
    (2 min) High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compute availability for extensive tuning, incorporation of sufficiently many baselines, and concrete documentation for reproducibility. In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. As of this writing, the collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. Our goal is to provide immediate starting points for experimentation with new methods or applications. Additionally we provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results. Code available at https://github.com/google/uncertainty-baselines.
    Algorithmic Stability and Generalization of an Unsupervised Feature Selection Algorithm. (arXiv:2010.09416v2 [cs.LG] UPDATED)
    (2 min) Feature selection, as a vital dimension reduction technique, reduces data dimension by identifying an essential subset of input features, which can facilitate interpretable insights into learning and inference processes. Algorithmic stability is a key characteristic of an algorithm regarding its sensitivity to perturbations of input samples. In this paper, we propose an innovative unsupervised feature selection algorithm attaining this stability with provable guarantees. The architecture of our algorithm consists of a feature scorer and a feature selector. The scorer trains a neural network (NN) to globally score all the features, and the selector adopts a dependent sub-NN to locally evaluate the representation abilities for selecting features. Further, we present algorithmic stability analysis and show that our algorithm has a performance guarantee via a generalization error bound. Extensive experimental results on real-world datasets demonstrate superior generalization performance of our proposed algorithm to strong baseline methods. Also, the properties revealed by our theoretical analysis and the stability of our algorithm-selected features are empirically confirmed.
    Modeling Human Driver Interactions Using an Infinite Policy Space Through Gaussian Processes. (arXiv:2201.01733v1 [cs.LG])
    (2 min) This paper proposes a method for modeling human driver interactions that relies on multi-output gaussian processes. The proposed method is developed as a refinement of the game theoretical hierarchical reasoning approach called "level-k reasoning" which conventionally assigns discrete levels of behaviors to agents. Although it is shown to be an effective modeling tool, the level-k reasoning approach may pose undesired constraints for predicting human decision making due to a limited number (usually 2 or 3) of driver policies it extracts. The proposed approach is put forward to fill this gap in the literature by introducing a continuous domain framework that enables an infinite policy space. By using the approach presented in this paper, more accurate driver models can be obtained, which can then be employed for creating high fidelity simulation platforms for the validation of autonomous vehicle control algorithms. The proposed method is validated on a real traffic dataset and compared with the conventional level-k approach to demonstrate its contributions and implications.
    Robust Self-Supervised Audio-Visual Speech Recognition. (arXiv:2201.01763v1 [cs.SD])
    (2 min) Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
    Aspis: A Robust Detection System for Distributed Learning. (arXiv:2108.02416v2 [cs.LG] UPDATED)
    (0 min) State-of-the-art machine learning models are routinely trained on large-scale distributed clusters. Crucially, such systems can be compromised when some of the computing devices exhibit abnormal (Byzantine) behavior and return arbitrary results to the parameter server (PS). This behavior may be attributed to a plethora of reasons, including system failures and orchestrated attacks. Existing work suggests robust aggregation and/or computational redundancy to alleviate the effect of distorted gradients. However, most of these schemes are ineffective when an adversary knows the task assignment and can choose the attacked workers judiciously to induce maximal damage. Our proposed method Aspis assigns gradient computations to worker nodes using a subset-based assignment which allows for multiple consistency checks on the behavior of a worker node. Examination of the calculated gradients and post-processing (clique-finding in an appropriately constructed graph) by the central node allows for efficient detection and subsequent exclusion of adversaries from the training process. We prove the Byzantine resilience and detection guarantees of Aspis under weak and strong attacks and extensively evaluate the system on various large-scale training scenarios. The principal metric for our experiments is the test accuracy, for which we demonstrate a significant improvement of about 30% compared to many state-of-the-art approaches on the CIFAR-10 dataset. The corresponding reduction of the fraction of corrupted gradients ranges from 16% to 99%.
    Morphology Decoder: A Machine Learning Guided 3D Vision Quantifying Heterogenous Rock Permeability for Planetary Surveillance and Robotic Functions. (arXiv:2111.13460v2 [cs.CV] UPDATED)
    (0 min) Permeability has a dominant influence on the flow properties of a natural fluid. Lattice Boltzmann simulator determines permeability from the nano and micropore network. The simulator holds millions of flow dynamics calculations with its accumulated errors and high consumption of computing power. To efficiently and consistently predict permeability, we propose a morphology decoder, a parallel and serial flow reconstruction of machine learning segmented heterogeneous Cretaceous texture from 3D micro computerized tomography and nuclear magnetic resonance images. For 3D vision, we introduce controllable-measurable-volume as new supervised segmentation, in which a unique set of voxel intensity corresponds to grain and pore throat sizes. The morphology decoder demarks and aggregates the morphologies boundaries in a novel way to produce permeability. Morphology decoder method consists of five novel processes, which describes in this paper, these novel processes are: (1) Geometrical 3D Permeability, (2) Machine Learning guided 3D Properties Recognition of Rock Morphology, (3) 3D Image Properties Integration Model for Permeability, (4) MRI Permeability Imager, and (5) Morphology Decoder (the process that integrates the other four novel processes).
    Conditional Imitation Learning for Multi-Agent Games. (arXiv:2201.01448v1 [cs.LG])
    (0 min) While advances in multi-agent learning have enabled the training of increasingly complex agents, most existing techniques produce a final policy that is not designed to adapt to a new partner's strategy. However, we would like our AI agents to adjust their strategy based on the strategies of those around them. In this work, we study the problem of conditional multi-agent imitation learning, where we have access to joint trajectory demonstrations at training time, and we must interact with and adapt to new partners at test time. This setting is challenging because we must infer a new partner's strategy and adapt our policy to that strategy, all without knowledge of the environment reward or dynamics. We formalize this problem of conditional multi-agent imitation learning, and propose a novel approach to address the difficulties of scalability and data scarcity. Our key insight is that variations across partners in multi-agent games are often highly structured, and can be represented via a low-rank subspace. Leveraging tools from tensor decomposition, our model learns a low-rank subspace over ego and partner agent strategies, then infers and adapts to a new partner strategy by interpolating in the subspace. We experiments with a mix of collaborative tasks, including bandits, particle, and Hanabi environments. Additionally, we test our conditional policies against real human partners in a user study on the Overcooked game. Our model adapts better to new partners compared to baselines, and robustly handles diverse settings ranging from discrete/continuous actions and static/online evaluation with AI/human partners.
    The Neural Coding Framework for Learning Generative Models. (arXiv:2012.03405v4 [cs.LG] UPDATED)
    (0 min) Neural generative models can be used to learn complex probability distributions from data, to sample from them, and to produce probability density estimates. We propose a computational framework for developing neural generative models inspired by the theory of predictive processing in the brain. According to predictive processing theory, the neurons in the brain form a hierarchy in which neurons in one level form expectations about sensory inputs from another level. These neurons update their local models based on differences between their expectations and the observed signals. In a similar way, artificial neurons in our generative models predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality. In this work, we show that the neural generative models learned within our framework perform well in practice across several benchmark datasets and metrics and either remain competitive with or significantly outperform other generative models with similar functionality (such as the variational auto-encoder).
    Challenges of Artificial Intelligence -- From Machine Learning and Computer Vision to Emotional Intelligence. (arXiv:2201.01466v1 [cs.AI])
    (0 min) Artificial intelligence (AI) has become a part of everyday conversation and our lives. It is considered as the new electricity that is revolutionizing the world. AI is heavily invested in both industry and academy. However, there is also a lot of hype in the current AI debate. AI based on so-called deep learning has achieved impressive results in many problems, but its limits are already visible. AI has been under research since the 1940s, and the industry has seen many ups and downs due to over-expectations and related disappointments that have followed. The purpose of this book is to give a realistic picture of AI, its history, its potential and limitations. We believe that AI is a helper, not a ruler of humans. We begin by describing what AI is and how it has evolved over the decades. After fundamentals, we explain the importance of massive data for the current mainstream of artificial intelligence. The most common representations for AI, methods, and machine learning are covered. In addition, the main application areas are introduced. Computer vision has been central to the development of AI. The book provides a general introduction to computer vision, and includes an exposure to the results and applications of our own research. Emotions are central to human intelligence, but little use has been made in AI. We present the basics of emotional intelligence and our own research on the topic. We discuss super-intelligence that transcends human understanding, explaining why such achievement seems impossible on the basis of present knowledge,and how AI could be improved. Finally, a summary is made of the current state of AI and what to do in the future. In the appendix, we look at the development of AI education, especially from the perspective of contents at our own university.
    Supervised Permutation Invariant Networks for Solving the CVRP with Bounded Fleet Size. (arXiv:2201.01529v1 [cs.LG])
    (0 min) Learning to solve combinatorial optimization problems, such as the vehicle routing problem, offers great computational advantages over classical operations research solvers and heuristics. The recently developed deep reinforcement learning approaches either improve an initially given solution iteratively or sequentially construct a set of individual tours. However, most of the existing learning-based approaches are not able to work for a fixed number of vehicles and thus bypass the complex assignment problem of the customers onto an apriori given number of available vehicles. On the other hand, this makes them less suitable for real applications, as many logistic service providers rely on solutions provided for a specific bounded fleet size and cannot accommodate short term changes to the number of vehicles. In contrast we propose a powerful supervised deep learning framework that constructs a complete tour plan from scratch while respecting an apriori fixed number of available vehicles. In combination with an efficient post-processing scheme, our supervised approach is not only much faster and easier to train but also achieves competitive results that incorporate the practical aspect of vehicle costs. In thorough controlled experiments we compare our method to multiple state-of-the-art approaches where we demonstrate stable performance, while utilizing less vehicles and shed some light on existent inconsistencies in the experimentation protocols of the related work.
    Relationship extraction for knowledge graph creation from biomedical literature. (arXiv:2201.01647v1 [cs.AI])
    (0 min) Biomedical research is growing in such an exponential pace that scientists, researchers and practitioners are no more able to cope with the amount of published literature in the domain. The knowledge presented in the literature needs to be systematized in such a ways that claims and hypothesis can be easily found, accessed and validated. Knowledge graphs can provide such framework for semantic knowledge representation from literature. However, in order to build knowledge graph, it is necessary to extract knowledge in form of relationships between biomedical entities and normalize both entities and relationship types. In this paper, we present and compare few rule-based and machine learning-based (Naive Bayes, Random Forests as examples of traditional machine learning methods and T5-based model as an example of modern deep learning) methods for scalable relationship extraction from biomedical literature for the integration into the knowledge graphs. We examine how resilient are these various methods to unbalanced and fairly small datasets, showing that T5 model handles well both small datasets, due to its pre-training on large C4 dataset as well as unbalanced data. The best performing model was T5 model fine-tuned on balanced data, with reported F1-score of 0.88.
    Stein Latent Optimization for Generative Adversarial Networks. (arXiv:2106.05319v4 [cs.LG] UPDATED)
    (0 min) Generative adversarial networks (GANs) with clustered latent spaces can perform conditional generation in a completely unsupervised manner. In the real world, the salient attributes of unlabeled data can be imbalanced. However, most of existing unsupervised conditional GANs cannot cluster attributes of these data in their latent spaces properly because they assume uniform distributions of the attributes. To address this problem, we theoretically derive Stein latent optimization that provides reparameterizable gradient estimations of the latent distribution parameters assuming a Gaussian mixture prior in a continuous latent space. Structurally, we introduce an encoder network and novel unsupervised conditional contrastive loss to ensure that data generated from a single mixture component represent a single attribute. We confirm that the proposed method, named Stein Latent Optimization for GANs (SLOGAN), successfully learns balanced or imbalanced attributes and achieves state-of-the-art unsupervised conditional generation performance even in the absence of attribute information (e.g., the imbalance ratio). Moreover, we demonstrate that the attributes to be learned can be manipulated using a small amount of probe data.
    Inferring dark matter substructure with astrometric lensing beyond the power spectrum. (arXiv:2110.01620v2 [astro-ph.CO] UPDATED)
    (0 min) Astrometry -- the precise measurement of positions and motions of celestial objects -- has emerged as a promising avenue for characterizing the dark matter population in our Galaxy. By leveraging recent advances in simulation-based inference and neural network architectures, we introduce a novel method to search for global dark matter-induced gravitational lensing signatures in astrometric datasets. Our method based on neural likelihood-ratio estimation shows significantly enhanced sensitivity to a cold dark matter population and more favorable scaling with measurement noise compared to existing approaches based on two-point correlation statistics. We demonstrate the real-world viability of our method by showing it to be robust to non-trivial modeled as well as unmodeled noise features expected in astrometric measurements. This establishes machine learning as a powerful tool for characterizing dark matter using astrometric data.
    Probing TryOnGAN. (arXiv:2201.01703v1 [cs.CV])
    (0 min) TryOnGAN is a recent virtual try-on approach, which generates highly realistic images and outperforms most previous approaches. In this article, we reproduce the TryOnGAN implementation and probe it along diverse angles: impact of transfer learning, variants of conditioning image generation with poses and properties of latent space interpolation. Some of these facets have never been explored in literature earlier. We find that transfer helps training initially but gains are lost as models train longer and pose conditioning via concatenation performs better. The latent space self-disentangles the pose and the style features and enables style transfer across poses. Our code and models are available in open source.
    exploRNN: Understanding Recurrent Neural Networks through Visual Exploration. (arXiv:2012.06326v2 [cs.LG] UPDATED)
    (0 min) Due to the success of deep learning (DL) and its growing job market, students and researchers from many areas are interested in learning about DL technologies. Visualization has proven to be of great help during this learning process. While most current educational visualizations are targeted towards one specific architecture or use case, recurrent neural networks (RNNs), which are capable of processing sequential data, are not covered yet. This is despite the fact that tasks on sequential data, such as text and function analysis, are at the forefront of DL research. Therefore, we propose exploRNN, the first interactively explorable educational visualization for RNNs. On the basis of making learning easier and more fun, we define educational objectives targeted towards understanding RNNs. We use these objectives to form guidelines for the visual design process. By means of exploRNN, which is accessible online, we provide an overview of the training process of RNNs at a coarse level, while also allowing a detailed inspection of the data flow within LSTM cells. In an empirical study, we assessed 37 subjects in a between-subjects design to investigate the learning outcomes and cognitive load of exploRNN compared to a classic text-based learning environment. While learners in the text group are ahead in superficial knowledge acquisition, exploRNN is particularly helpful for deeper understanding of the learning content. In addition, the complex content in exploRNN is perceived as significantly easier and causes less extraneous load than in the text group. The study shows that for difficult learning material such as recurrent networks, where deep understanding is important, interactive visualizations such as exploRNN can be helpful.
    How Does Knowledge Graph Embedding Extrapolate to Unseen Data: a Semantic Evidence View. (arXiv:2109.11800v2 [cs.CL] UPDATED)
    (0 min) Knowledge Graph Embedding (KGE) aims to learn representations for entities and relations. Most KGE models have gained great success, especially on extrapolation scenarios. Specifically, given an unseen triple (h, r, t), a trained model can still correctly predict t from (h, r, ?), or h from (?, r, t), such extrapolation ability is impressive. However, most existing KGE works focus on the design of delicate triple modeling function, which mainly tells us how to measure the plausibility of observed triples, but offers limited explanation of why the methods can extrapolate to unseen data, and what are the important factors to help KGE extrapolate. Therefore in this work, we attempt to study the KGE extrapolation of two problems: 1. How does KGE extrapolate to unseen data? 2. How to design the KGE model with better extrapolation ability? For the problem 1, we first discuss the impact factors for extrapolation and from relation, entity and triple level respectively, propose three Semantic Evidences (SEs), which can be observed from train set and provide important semantic information for extrapolation. Then we verify the effectiveness of SEs through extensive experiments on several typical KGE methods. For the problem 2, to make better use of the three levels of SE, we propose a novel GNN-based KGE model, called Semantic Evidence aware Graph Neural Network (SE-GNN). In SE-GNN, each level of SE is modeled explicitly by the corresponding neighbor pattern, and merged sufficiently by the multi-layer aggregation, which contributes to obtaining more extrapolative knowledge representation. Finally, through extensive experiments on FB15k-237 and WN18RR datasets, we show that SE-GNN achieves state-of-the-art performance on Knowledge Graph Completion task and performs a better extrapolation ability.
    Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks. (arXiv:2111.15256v2 [cs.LG] UPDATED)
    (0 min) We argue the Fisher information matrix (FIM) of one hidden layer networks with the ReLU activation function. Let $W$ denote the $d \times p$ weight matrix from the $d$-dimensional input to the hidden layer consisting of $p$ neurons, and $v$ the $p$-dimensional weight vector from the hidden layer to the scalar output. We focus on the FIM of $v$, which we denote as $I$. When $p$ is large, under certain conditions, the following approximately holds. 1) There are three major clusters in the eigenvalue distribution. 2) Since $I$ is non-negative owing to the ReLU, the first eigenvalue is the Perron-Frobenius eigenvalue. 3) For the cluster of the next maximum values, the eigenspace is spanned by the row vectors of $W$. 4) The direct sum of the eigenspace of the first eigenvalue and that of the third cluster is spanned by the set of all the vectors obtained as the Hadamard product of any pair of the row vectors of $W$. We confirmed by numerical simulation that the above is approximately correct when the number of hidden nodes is about 10000.
    Debiased Learning from Naturally Imbalanced Pseudo-Labels for Zero-Shot and Semi-Supervised Learning. (arXiv:2201.01490v1 [cs.LG])
    (0 min) This work studies the bias issue of pseudo-labeling, a natural phenomenon that widely occurs but often overlooked by prior research. Pseudo-labels are generated when a classifier trained on source data is transferred to unlabeled target data. We observe heavy long-tailed pseudo-labels when a semi-supervised learning model FixMatch predicts labels on the unlabeled set even though the unlabeled data is curated to be balanced. Without intervention, the training model inherits the bias from the pseudo-labels and end up being sub-optimal. To eliminate the model bias, we propose a simple yet effective method DebiasMatch, comprising of an adaptive debiasing module and an adaptive marginal loss. The strength of debiasing and the size of margins can be automatically adjusted by making use of an online updated queue. Benchmarked on ImageNet-1K, DebiasMatch significantly outperforms previous state-of-the-arts by more than 26% and 8.7% on semi-supervised learning (0.2% annotated data) and zero-shot learning tasks respectively.
    Fractal Autoencoders for Feature Selection. (arXiv:2010.09430v2 [cs.LG] UPDATED)
    (0 min) Feature selection reduces the dimensionality of data by identifying a subset of the most informative features. In this paper, we propose an innovative framework for unsupervised feature selection, called fractal autoencoders (FAE). It trains a neural network to pinpoint informative features for global exploring of representability and for local excavating of diversity. Architecturally, FAE extends autoencoders by adding a one-to-one scoring layer and a small sub-neural network for feature selection in an unsupervised fashion. With such a concise architecture, FAE achieves state-of-the-art performances; extensive experimental results on fourteen datasets, including very high-dimensional data, have demonstrated the superiority of FAE over existing contemporary methods for unsupervised feature selection. In particular, FAE exhibits substantial advantages on gene expression data exploration, reducing measurement cost by about $15$\% over the widely used L1000 landmark genes. Further, we show that the FAE framework is easily extensible with an application.
    Dynamic GPU Energy Optimization for Machine Learning Training Workloads. (arXiv:2201.01684v1 [cs.DC])
    (0 min) GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.
    Optimal Transport using GANs for Lineage Tracing. (arXiv:2007.12098v3 [cs.LG] UPDATED)
    (2 min) In this paper, we present Super-OT, a novel approach to computational lineage tracing that combines a supervised learning framework with optimal transport based on Generative Adversarial Networks (GANs). Unlike previous approaches to lineage tracing, Super-OT has the flexibility to integrate paired data. We benchmark Super-OT based on single-cell RNA-seq data against Waddington-OT, a popular approach for lineage tracing that also employs optimal transport. We show that Super-OT achieves gains over Waddington-OT in predicting the class outcome of cells during differentiation, since it allows the integration of additional information during training.
    Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation. (arXiv:2201.01666v1 [cs.LG])
    (2 min) In model-free deep reinforcement learning (RL) algorithms, using noisy value estimates to supervise policy evaluation and optimization is detrimental to the sample efficiency. As this noise is heteroscedastic, its effects can be mitigated using uncertainty-based weights in the optimization process. Previous methods rely on sampled ensembles, which do not capture all aspects of uncertainty. We provide a systematic analysis of the sources of uncertainty in the noisy supervision that occurs in RL, and introduce inverse-variance RL, a Bayesian framework which combines probabilistic ensembles and Batch Inverse Variance weighting. We propose a method whereby two complementary uncertainty estimation methods account for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision. Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks.
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v1 [stat.ML])
    (2 min) A common approach to solving tasks, such as node classification or link prediction, on a large network begins by learning a Euclidean embedding of the nodes of the network, from which regular machine learning methods can be applied. For unsupervised random walk methods such as DeepWalk and node2vec, adding a $\ell_2$ penalty on the embedding vectors to the loss leads to improved downstream task performance. In this paper we study the effects of this regularization and prove that, under exchangeability assumptions on the graph, it asymptotically leads to learning a nuclear-norm-type penalized graphon. In particular, the exact form of the penalty depends on the choice of subsampling method used within stochastic gradient descent to learn the embeddings. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, if not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Approximation capabilities of measure-preserving neural networks. (arXiv:2106.10911v2 [cs.LG] UPDATED)
    (2 min) Measure-preserving neural networks are well-developed invertible models, however, their approximation capabilities remain unexplored. This paper rigorously analyses the approximation capabilities of existing measure-preserving neural networks including NICE and RevNets. It is shown that for compact $U \subset \R^D$ with $D\geq 2$, the measure-preserving neural networks are able to approximate arbitrary measure-preserving map $\psi: U\to \R^D$ which is bounded and injective in the $L^p$-norm. In particular, any continuously differentiable injective map with $\pm 1$ determinant of Jacobian are measure-preserving, thus can be approximated.
    Interpretable machine learning for high-dimensional trajectories of aging health. (arXiv:2105.03410v2 [q-bio.QM] UPDATED)
    (2 min) We have built a computational model for individual aging trajectories of health and survival, which contains physical, functional, and biological variables, and is conditioned on demographic, lifestyle, and medical background information. We combine techniques of modern machine learning with an interpretable interaction network, where health variables are coupled by explicit pair-wise interactions within a stochastic dynamical system. Our dynamic joint interpretable network (DJIN) model is scalable to large longitudinal data sets, is predictive of individual high-dimensional health trajectories and survival from baseline health states, and infers an interpretable network of directed interactions between the health variables. The network identifies plausible physiological connections between health variables as well as clusters of strongly connected health variables. We use English Longitudinal Study of Aging (ELSA) data to train our model and show that it performs better than multiple dedicated linear models for health outcomes and survival. We compare our model with flexible lower-dimensional latent-space models to explore the dimensionality required to accurately model aging health outcomes. Our DJIN model can be used to generate synthetic individuals that age realistically, to impute missing data, and to simulate future aging outcomes given arbitrary initial health states.
    Deep Reinforcement Learning at the Edge of the Statistical Precipice. (arXiv:2108.13264v4 [cs.LG] UPDATED)
    (3 min) Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.
    Constrained Optimization to Train Neural Networks on Critical and Under-Represented Classes. (arXiv:2102.12894v4 [cs.LG] UPDATED)
    (3 min) Deep neural networks (DNNs) are notorious for making more mistakes for the classes that have substantially fewer samples than the others during training. Such class imbalance is ubiquitous in clinical applications and very crucial to handle because the classes with fewer samples most often correspond to critical cases (e.g., cancer) where misclassifications can have severe consequences. Not to miss such cases, binary classifiers need to be operated at high True Positive Rates (TPRs) by setting a higher threshold, but this comes at the cost of very high False Positive Rates (FPRs) for problems with class imbalance. Existing methods for learning under class imbalance most often do not take this into account. We argue that prediction accuracy should be improved by emphasizing reducing FPRs at high TPRs for problems where misclassification of the positive, i.e. critical, class samples are associated with higher cost. To this end, we pose the training of a DNN for binary classification as a constrained optimization problem and introduce a novel constraint that can be used with existing loss functions to enforce maximal area under the ROC curve (AUC) through prioritizing FPR reduction at high TPR. We solve the resulting constrained optimization problem using an Augmented Lagrangian method (ALM). Going beyond binary, we also propose two possible extensions of the proposed constraint for multi-class classification problems. We present experimental results for image-based binary and multi-class classification applications using an in-house medical imaging dataset, CIFAR10, and CIFAR100. Our results demonstrate that the proposed method improves the baselines in majority of the cases by attaining higher accuracy on critical classes while reducing the misclassification rate for the non-critical class samples.
    Astraea: Grammar-based Fairness Testing. (arXiv:2010.02542v4 [cs.SE] UPDATED)
    (2 min) Software often produces biased outputs. In particular, machine learning (ML) based software are known to produce erroneous predictions when processing discriminatory inputs. Such unfair program behavior can be caused by societal bias. In the last few years, Amazon, Microsoft and Google have provided software services that produce unfair outputs, mostly due to societal bias (e.g. gender or race). In such events, developers are saddled with the task of conducting fairness testing. Fairness testing is challenging; developers are tasked with generating discriminatory inputs that reveal and explain biases. We propose a grammar-based fairness testing approach (called ASTRAEA) which leverages context-free grammars to generate discriminatory inputs that reveal fairness violations in software systems. Using probabilistic grammars, ASTRAEA also provides fault diagnosis by isolating the cause of observed software bias. ASTRAEA's diagnoses facilitate the improvement of ML fairness. ASTRAEA was evaluated on 18 software systems that provide three major natural language processing (NLP) services. In our evaluation, ASTRAEA generated fairness violations with a rate of ~18%. ASTRAEA generated over 573K discriminatory test cases and found over 102K fairness violations. Furthermore, ASTRAEA improves software fairness by ~76%, via model-retraining.
    Exemplar-free Class Incremental Learning via Discriminative and Comparable One-class Classifiers. (arXiv:2201.01488v1 [cs.LG])
    (2 min) The exemplar-free class incremental learning requires classification models to learn new class knowledge incrementally without retaining any old samples. Recently, the framework based on parallel one-class classifiers (POC), which trains a one-class classifier (OCC) independently for each category, has attracted extensive attention, since it can naturally avoid catastrophic forgetting. POC, however, suffers from weak discriminability and comparability due to its independent training strategy for different OOCs. To meet this challenge, we propose a new framework, named Discriminative and Comparable One-class classifiers for Incremental Learning (DisCOIL). DisCOIL follows the basic principle of POC, but it adopts variational auto-encoders (VAE) instead of other well-established one-class classifiers (e.g. deep SVDD), because a trained VAE can not only identify the probability of an input sample belonging to a class but also generate pseudo samples of the class to assist in learning new tasks. With this advantage, DisCOIL trains a new-class VAE in contrast with the old-class VAEs, which forces the new-class VAE to reconstruct better for new-class samples but worse for the old-class pseudo samples, thus enhancing the comparability. Furthermore, DisCOIL introduces a hinge reconstruction loss to ensure the discriminability. We evaluate our method extensively on MNIST, CIFAR10, and Tiny-ImageNet. The experimental results show that DisCOIL achieves state-of-the-art performance.
    Reproducible and Portable Big Data Analytics in the Cloud. (arXiv:2112.09762v1 [cs.DC] CROSS LISTED)
    (2 min) Cloud computing has become a major approach to enable reproducible computational experiments because of its support of on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of big data analytics in the cloud including virtual distributed environment provisioning, network and security group setup, and big data analytics pipeline description and execution. The second is an application developed for one cloud, such as AWS or Azure, is difficult to reproduce in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automatic scalable big data application execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. Based on the approach, we propose and develop an open-source toolkit that supports 1) on-demand distributed hardware and software environment provisioning, 2) automatic data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproducibility of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using three big data analytics applications that run on a virtual CPU/GPU cluster. Three main behaviors of our toolkit were benchmarked: i) execution overhead ratio for reproducibility support, ii) differences of reproducing the same application on AWS and Azure in terms of execution time, budgetary cost and cost-performance ratio, iii) differences between scale-out and scale-up approach for the same application on AWS and Azure.
    Towards Similarity-Aware Time-Series Classification. (arXiv:2201.01413v1 [cs.LG])
    (0 min) We study time-series classification (TSC), a fundamental task of time-series data mining. Prior work has approached TSC from two major directions: (1) similarity-based methods that classify time-series based on the nearest neighbors, and (2) deep learning models that directly learn the representations for classification in a data-driven manner. Motivated by the different working mechanisms within these two research lines, we aim to connect them in such a way as to jointly model time-series similarities and learn the representations. This is a challenging task because it is unclear how we should efficiently leverage similarity information. To tackle the challenge, we propose Similarity-Aware Time-Series Classification (SimTSC), a conceptually simple and general framework that models similarity information with graph neural networks (GNNs). Specifically, we formulate TSC as a node classification problem in graphs, where the nodes correspond to time-series, and the links correspond to pair-wise similarities. We further design a graph construction strategy and a batch training algorithm with negative sampling to improve training efficiency. We instantiate SimTSC with ResNet as the backbone and Dynamic Time Warping (DTW) as the similarity measure. Extensive experiments on the full UCR datasets and several multivariate datasets demonstrate the effectiveness of incorporating similarity information into deep learning models in both supervised and semi-supervised settings. Our code is available at https://github.com/daochenzha/SimTSC
    AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax Optimization. (arXiv:2106.16101v4 [math.OC] UPDATED)
    (2 min) In the paper, we propose a class of faster adaptive Gradient Descent Ascent (GDA) methods for solving the nonconvex-strongly-concave minimax problems based on unified adaptive matrices, which include almost existing coordinate-wise and global adaptive learning rates. Specifically, we propose a fast Adaptive Gradient Decent Ascent (AdaGDA) method based on the basic momentum technique, which reaches a lower gradient complexity of $O(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary point without large batches, which improves the results of the existing adaptive GDA methods by a factor of $O(\sqrt{\kappa})$. At the same time, we present an accelerated version of AdaGDA (VR-AdaGDA) method based on the momentum-based variance reduced technique, which achieves a lower gradient complexity of $O(\kappa^{4.5}\epsilon^{-3})$ for finding an $\epsilon$-stationary point without large batches, which improves the results of the existing adaptive GDA methods by a factor of $O(\epsilon^{-1})$. Moreover, we prove that our VR-AdaGDA method reaches the best known gradient complexity of $O(\kappa^{3}\epsilon^{-3})$ with the mini-batch size $O(\kappa^3)$. In particular, we provide an effective convergence analysis framework for our adaptive GDA methods. Some experimental results on policy evaluation and fair classifier tasks demonstrate the efficiency of our algorithms.
    Mining Adverse Drug Reactions from Unstructured Mediums at Scale. (arXiv:2201.01405v1 [cs.CL])
    (2 min) Adverse drug reactions / events (ADR/ADE) have a major impact on patient health and health care costs. Detecting ADR's as early as possible and sharing them with regulators, pharma companies, and healthcare providers can prevent morbidity and save many lives. While most ADR's are not reported via formal channels, they are often documented in a variety of unstructured conversations such as social media posts by patients, customer support call transcripts, or CRM notes of meetings between healthcare providers and pharma sales reps. In this paper, we propose a natural language processing (NLP) solution that detects ADR's in such unstructured free-text conversations, which improves on previous work in three ways. First, a new Named Entity Recognition (NER) model obtains new state-of-the-art accuracy for ADR and Drug entity extraction on the ADE, CADEC, and SMM4H benchmark datasets (91.75%, 78.76%, and 83.41% F1 scores respectively). Second, two new Relation Extraction (RE) models are introduced - one based on BioBERT while the other utilizing crafted features over a Fully Connected Neural Network (FCNN) - are shown to perform on par with existing state-of-the-art models, and outperform them when trained with a supplementary clinician-annotated RE dataset. Third, a new text classification model, for deciding if a conversation includes an ADR, obtains new state-of-the-art accuracy on the CADEC dataset (86.69% F1 score). The complete solution is implemented as a unified NLP pipeline in a production-grade library built on top of Apache Spark, making it natively scalable and able to process millions of batch or streaming records on commodity clusters.
    Boosting Algorithms for Delivery Time Prediction in Transportation Logistics. (arXiv:2009.11598v2 [cs.LG] UPDATED)
    (0 min) Travel time is a crucial measure in transportation. Accurate travel time prediction is also fundamental for operation and advanced information systems. A variety of solutions exist for short-term travel time predictions such as solutions that utilize real-time GPS data and optimization methods to track the path of a vehicle. However, reliable long-term predictions remain challenging. We show in this paper the applicability and usefulness of travel time i.e. delivery time prediction for postal services. We investigate several methods such as linear regression models and tree based ensembles such as random forest, bagging, and boosting, that allow to predict delivery time by conducting extensive experiments and considering many usability scenarios. Results reveal that travel time prediction can help mitigate high delays in postal services. We show that some boosting algorithms, such as light gradient boosting and catboost, have a higher performance in terms of accuracy and runtime efficiency than other baselines such as linear regression models, bagging regressor and random forest.
    Using Machine Learning for Anomaly Detection on a System-on-Chip under Gamma Radiation. (arXiv:2201.01588v1 [cs.LG])
    (2 min) The emergence of new nanoscale technologies has imposed significant challenges to designing reliable electronic systems in radiation environments. A few types of radiation like Total Ionizing Dose (TID) effects often cause permanent damages on such nanoscale electronic devices, and current state-of-the-art technologies to tackle TID make use of expensive radiation-hardened devices. This paper focuses on a novel and different approach: using machine learning algorithms on consumer electronic level Field Programmable Gate Arrays (FPGAs) to tackle TID effects and monitor them to replace before they stop working. This condition has a research challenge to anticipate when the board results in a total failure due to TID effects. We observed internal measurements of the FPGA boards under gamma radiation and used three different anomaly detection machine learning (ML) algorithms to detect anomalies in the sensor measurements in a gamma-radiated environment. The statistical results show a highly significant relationship between the gamma radiation exposure levels and the board measurements. Moreover, our anomaly detection results have shown that a One-Class Support Vector Machine with Radial Basis Function Kernel has an average Recall score of 0.95. Also, all anomalies can be detected before the boards stop working.
    Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems. (arXiv:2201.01680v1 [cs.LG])
    (2 min) This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.
    Online certification of preference-based fairness for personalized recommender systems. (arXiv:2104.14527v4 [cs.LG] UPDATED)
    (0 min) Recommender systems are facing scrutiny because of their growing impact on the opportunities we have access to. Current audits for fairness are limited to coarse-grained parity assessments at the level of sensitive groups. We propose to audit for envy-freeness, a more granular criterion aligned with individual preferences: every user should prefer their recommendations to those of other users. Since auditing for envy requires to estimate the preferences of users beyond their existing recommendations, we cast the audit as a new pure exploration problem in multi-armed bandits. We propose a sample-efficient algorithm with theoretical guarantees that it does not deteriorate user experience. We also study the trade-offs achieved on real-world recommendation datasets.
    Deducing Optimal Classification Algorithm for Heterogeneous Fabric. (arXiv:2111.05558v3 [cs.LG] UPDATED)
    (0 min) For defining the optimal machine learning algorithm, the decision was not easy for which we shall choose. To help future researchers, we describe in this paper the optimal among the best of the algorithms. We built a synthetic data set and performed the supervised machine learning runs for five different algorithms. For heterogeneous rock fabric, we identified Random Forest, among others, to be the appropriate algorithm.
    Tracking Most Severe Arm Changes in Bandits. (arXiv:2112.13838v2 [cs.LG] UPDATED)
    (0 min) In bandits with distribution shifts, one aims to automatically detect an unknown number $L$ of changes in reward distribution, and restart exploration when necessary. While this problem remained open for many years, a recent breakthrough of Auer et al. (2018, 2019) provide the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, with no knowledge of $L$. However, not all distributional shifts are equally severe, e.g., suppose no best arm switches occur, then we cannot rule out that a regret $O(\sqrt{T})$ may remain possible; in other words, is it possible to achieve dynamic regret that optimally scales only with an unknown number of severe shifts? This unfortunately has remained elusive, despite various attempts (Auer et al., 2019, Foster et al., 2020). We resolve this problem in the case of two-armed bandits: we derive an adaptive procedure that guarantees a dynamic regret of order $\tilde{O}(\sqrt{\tilde{L} T})$, where $\tilde L \ll L$ captures an unknown number of severe best arm changes, i.e., with significant switches in rewards, and which last sufficiently long to actually require a restart. As a consequence, for any number $L$ of distributional shifts outside of these severe shifts, our procedure achieves regret just $\tilde{O}(\sqrt{T})\ll \tilde{O}(\sqrt{LT})$. Finally, we note that our notion of severe shift applies in both classical settings of stochastic switching bandits and of adversarial bandits.
    Physics-informed neural network method for modelling beam-wall interactions. (arXiv:2112.11323v2 [physics.acc-ph] UPDATED)
    (2 min) A mesh-free approach for modelling beam-wall interactions in particle accelerators is proposed. The key idea of our method is to use a deep neural network as a surrogate for the solution to a set of partial differential equations involving the particle beam, and the surface impedance concept. The proposed approach is applied to the coupling impedance of an accelerator vacuum chamber with thin conductive coating, and also verified in comparison with the existing analytical formula.
    TUNet: A Block-online Bandwidth Extension Model based on Transformers and Self-supervised Pretraining. (arXiv:2110.13492v2 [cs.LG] UPDATED)
    (2 min) We introduce a block-online variant of the temporal feature-wise linear modulation (TFiLM) model to achieve bandwidth extension. The proposed architecture simplifies the UNet backbone of the TFiLM to reduce inference time and employs an efficient transformer at the bottleneck to alleviate performance degradation. We also utilize self-supervised pretraining and data augmentation to enhance the quality of bandwidth extended signals and reduce the sensitivity with respect to downsampling methods. Experiment results on the VCTK dataset show that the proposed method outperforms several recent baselines in both intrusive and non-intrusive metrics. Pretraining and filter augmentation also help stabilize and enhance the overall performance.
    Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective. (arXiv:2201.01741v1 [stat.ML])
    (2 min) Entropy coding is the backbone data compression. Novel machine-learning based compression methods often use a new entropy coder called Asymmetric Numeral Systems (ANS) [Duda et al., 2015], which provides very close to optimal bitrates and simplifies [Townsend et al., 2019] advanced compression techniques such as bits-back coding. However, researchers with a background in machine learning often struggle to understand how ANS works, which prevents them from exploiting its full versatility. This paper is meant as an educational resource to make ANS more approachable by presenting it from a new perspective of latent variable models and the so-called bits-back trick. We guide the reader step by step to a complete implementation of ANS in the Python programming language, which we then generalize for more advanced use cases. We also present and empirically evaluate an open-source library of various entropy coders designed for both research and production use. Related teaching videos and problem sets are available online.
    Bringing Your Own View: Graph Contrastive Learning without Prefabricated Data Augmentations. (arXiv:2201.01702v1 [cs.LG])
    (0 min) Self-supervision is recently surging at its new frontier of graph learning. It facilitates graph representations beneficial to downstream tasks; but its success could hinge on domain knowledge for handcraft or the often expensive trials and errors. Even its state-of-the-art representative, graph contrastive learning (GraphCL), is not completely free of those needs as GraphCL uses a prefabricated prior reflected by the ad-hoc manual selection of graph data augmentations. Our work aims at advancing GraphCL by answering the following questions: How to represent the space of graph augmented views? What principle can be relied upon to learn a prior in that space? And what framework can be constructed to learn the prior in tandem with contrastive learning? Accordingly, we have extended the prefabricated discrete prior in the augmentation set, to a learnable continuous prior in the parameter space of graph generators, assuming that graph priors per se, similar to the concept of image manifolds, can be learned by data generation. Furthermore, to form contrastive views without collapsing to trivial solutions due to the prior learnability, we have leveraged both principles of information minimization (InfoMin) and information bottleneck (InfoBN) to regularize the learned priors. Eventually, contrastive learning, InfoMin, and InfoBN are incorporated organically into one framework of bi-level optimization. Our principled and automated approach has proven to be competitive against the state-of-the-art graph self-supervision methods, including GraphCL, on benchmarks of small graphs; and shown even better generalizability on large-scale graphs, without resorting to human expertise or downstream validation. Our code is publicly released at https://github.com/Shen-Lab/GraphCL_Automated.
    Sign Language Recognition System using TensorFlow Object Detection API. (arXiv:2201.01486v1 [cs.CV])
    (0 min) Communication is defined as the act of sharing or exchanging information, ideas or feelings. To establish communication between two people, both of them are required to have knowledge and understanding of a common language. But in the case of deaf and dumb people, the means of communication are different. Deaf is the inability to hear and dumb is the inability to speak. They communicate using sign language among themselves and with normal people but normal people do not take seriously the importance of sign language. Not everyone possesses the knowledge and understanding of sign language which makes communication difficult between a normal person and a deaf and dumb person. To overcome this barrier, one can build a model based on machine learning. A model can be trained to recognize different gestures of sign language and translate them into English. This will help a lot of people in communicating and conversing with deaf and dumb people. The existing Indian Sing Language Recognition systems are designed using machine learning algorithms with single and double-handed gestures but they are not real-time. In this paper, we propose a method to create an Indian Sign Language dataset using a webcam and then using transfer learning, train a TensorFlow model to create a real-time Sign Language Recognition system. The system achieves a good level of accuracy even with a limited size dataset.
    Meta Two-Sample Testing: Learning Kernels for Testing with Limited Data. (arXiv:2106.07636v2 [stat.ML] UPDATED)
    (0 min) Modern kernel-based two-sample tests have shown great success in distinguishing complex, high-dimensional distributions with appropriate learned kernels. Previous work has demonstrated that this kernel learning procedure succeeds, assuming a considerable number of observed samples from each distribution. In realistic scenarios with very limited numbers of data samples, however, it can be challenging to identify a kernel powerful enough to distinguish complex distributions. We address this issue by introducing the problem of meta two-sample testing (M2ST), which aims to exploit (abundant) auxiliary data on related tasks to find an algorithm that can quickly identify a powerful test on new target tasks. We propose two specific algorithms for this task: a generic scheme which improves over baselines and a more tailored approach which performs even better. We provide both theoretical justification and empirical evidence that our proposed meta-testing schemes out-perform learning kernel-based tests directly from scarce observations, and identify when such schemes will be successful.
    Online Adaptation to Label Distribution Shift. (arXiv:2107.04520v2 [cs.LG] UPDATED)
    (0 min) Machine learning models often encounter distribution shifts when deployed in the real world. In this paper, we focus on adaptation to label distribution shift in the online setting, where the test-time label distribution is continually changing and the model must dynamically adapt to it without observing the true label. Leveraging a novel analysis, we show that the lack of true label does not hinder estimation of the expected test loss, which enables the reduction of online label shift adaptation to conventional online learning. Informed by this observation, we propose adaptation algorithms inspired by classical online learning techniques such as Follow The Leader (FTL) and Online Gradient Descent (OGD) and derive their regret bounds. We empirically verify our findings under both simulated and real world label distribution shifts and show that OGD is particularly effective and robust to a variety of challenging label shift scenarios.
    Augmenting astrophysical scaling relations with machine learning : application to reducing the SZ flux-mass scatter. (arXiv:2201.01305v1 [astro-ph.CO])
    (2 min) Complex systems (stars, supernovae, galaxies, and clusters) often exhibit low scatter relations between observable properties (e.g., luminosity, velocity dispersion, oscillation period, temperature). These scaling relations can illuminate the underlying physics and can provide observational tools for estimating masses and distances. Machine learning can provide a systematic way to search for new scaling relations (or for simple extensions to existing relations) in abstract high-dimensional parameter spaces. We use a machine learning tool called symbolic regression (SR), which models the patterns in a given dataset in the form of analytic equations. We focus on the Sunyaev-Zeldovich flux$-$cluster mass relation ($Y_\mathrm{SZ}-M$), the scatter in which affects inference of cosmological parameters from cluster abundance data. Using SR on the data from the IllustrisTNG hydrodynamical simulation, we find a new proxy for cluster mass which combines $Y_\mathrm{SZ}$ and concentration of ionized gas ($c_\mathrm{gas}$): $M \propto Y_\mathrm{conc}^{3/5} \equiv Y_\mathrm{SZ}^{3/5} (1-A\, c_\mathrm{gas})$. $Y_\mathrm{conc}$ reduces the scatter in the predicted $M$ by $\sim 20-30$% for large clusters ($M\gtrsim 10^{14}\, h^{-1} \, M_\odot$) at both high and low redshifts, as compared to using just $Y_\mathrm{SZ}$. We show that the dependence on $c_\mathrm{gas}$ is linked to cores of clusters exhibiting larger scatter than their outskirts. Finally, we test $Y_\mathrm{conc}$ on clusters from simulations of the CAMELS project and show that $Y_\mathrm{conc}$ is robust against variations in cosmology, astrophysics, subgrid physics, and cosmic variance. Our results and methodology can be useful for accurate multiwavelength cluster mass estimation from current and upcoming CMB and X-ray surveys like ACT, SO, SPT, eROSITA and CMB-S4.
    Corrupting Data to Remove Deceptive Perturbation: Using Preprocessing Method to Improve System Robustness. (arXiv:2201.01399v1 [cs.CV])
    (2 min) Although deep neural networks have achieved great performance on classification tasks, recent studies showed that well trained networks can be fooled by adding subtle noises. This paper introduces a new approach to improve neural network robustness by applying the recovery process on top of the naturally trained classifier. In this approach, images will be intentionally corrupted by some significant operator and then be recovered before passing through the classifiers. SARGAN -- an extension on Generative Adversarial Networks (GAN) is capable of denoising radar signals. This paper will show that SARGAN can also recover corrupted images by removing the adversarial effects. Our results show that this approach does improve the performance of naturally trained networks.
    Convergence and Complexity of Stochastic Block Majorization-Minimization. (arXiv:2201.01652v1 [math.OC])
    (2 min) Stochastic majorization-minimization (SMM) is an online extension of the classical principle of majorization-minimization, which consists of sampling i.i.d. data points from a fixed data distribution and minimizing a recursively defined majorizing surrogate of an objective function. In this paper, we introduce stochastic block majorization-minimization, where the surrogates can now be only block multi-convex and a single block is optimized at a time within a diminishing radius. Relaxing the standard strong convexity requirements for surrogates in SMM, our framework gives wider applicability including online CANDECOMP/PARAFAC (CP) dictionary learning and yields greater computational efficiency especially when the problem dimension is large. We provide an extensive convergence analysis on the proposed algorithm, which we derive under possibly dependent data streams, relaxing the standard i.i.d. assumption on data samples. We show that the proposed algorithm converges almost surely to the set of stationary points of a nonconvex objective under constraints at a rate $O((\log n)^{1+\eps}/n^{1/2})$ for the empirical loss function and $O((\log n)^{1+\eps}/n^{1/4})$ for the expected loss function, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $O((\log n)^{1+\eps}/n^{1/2})$. Our results provide first convergence rate bounds for various online matrix and tensor decomposition algorithms under a general Markovian data setting.
    Synthesizing Tensor Transformations for Visual Self-attention. (arXiv:2201.01410v1 [cs.CV])
    (2 min) Self-attention shows outstanding competence in capturing long-range relationships while enhancing performance on vision tasks, such as image classification and image captioning. However, the self-attention module highly relies on the dot product multiplication and dimension alignment among query-key-value features, which cause two problems: (1) The dot product multiplication results in exhaustive and redundant computation. (2) Due to the visual feature map often appearing as a multi-dimensional tensor, reshaping the scale of the tensor feature to adapt to the dimension alignment might destroy the internal structure of the tensor feature map. To address these problems, this paper proposes a self-attention plug-in module with its variants, namely, Synthesizing Tensor Transformations (STT), for directly processing image tensor features. Without computing the dot-product multiplication among query-key-value, the basic STT is composed of the tensor transformation to learn the synthetic attention weight from visual information. The effectiveness of STT series is validated on the image classification and image caption. Experiments show that the proposed STT achieves competitive performance while keeping robustness compared to self-attention based above vision tasks.
    Adaptive Online Incremental Learning for Evolving Data Streams. (arXiv:2201.01633v1 [cs.LG])
    (2 min) Recent years have witnessed growing interests in online incremental learning. However, there are three major challenges in this area. The first major difficulty is concept drift, that is, the probability distribution in the streaming data would change as the data arrives. The second major difficulty is catastrophic forgetting, that is, forgetting what we have learned before when learning new knowledge. The last one we often ignore is the learning of the latent representation. Only good latent representation can improve the prediction accuracy of the model. Our research builds on this observation and attempts to overcome these difficulties. To this end, we propose an Adaptive Online Incremental Learning for evolving data streams (AOIL). We use auto-encoder with the memory module, on the one hand, we obtained the latent features of the input, on the other hand, according to the reconstruction loss of the auto-encoder with memory module, we could successfully detect the existence of concept drift and trigger the update mechanism, adjust the model parameters in time. In addition, we divide features, which are derived from the activation of the hidden layers, into two parts, which are used to extract the common and private features respectively. By means of this approach, the model could learn the private features of the new coming instances, but do not forget what we have learned in the past (shared features), which reduces the occurrence of catastrophic forgetting. At the same time, to get the fusion feature vector we use the self-attention mechanism to effectively fuse the extracted features, which further improved the latent representation learning.
    Sample Selection with Deadline Control for Efficient Federated Learning on Heterogeneous Clients. (arXiv:2201.01601v1 [cs.LG])
    (2 min) Federated Learning (FL) trains a machine learning model on distributed clients without exposing individual data. Unlike centralized training that is usually based on carefully-organized data, FL deals with on-device data that are often unfiltered and imbalanced. As a result, conventional FL training protocol that treats all data equally leads to a waste of local computational resources and slows down the global learning process. To this end, we propose FedBalancer, a systematic FL framework that actively selects clients' training samples. Our sample selection strategy prioritizes more "informative" data while respecting privacy and computational capabilities of clients. To better utilize the sample selection to speed up global training, we further introduce an adaptive deadline control scheme that predicts the optimal deadline for each round with varying client train data. Compared with existing FL algorithms with deadline configuration methods, our evaluation on five datasets from three different domains shows that FedBalancer improves the time-to-accuracy performance by 1.22~4.62x while improving the model accuracy by 1.0~3.3%. We also show that FedBalancer is readily applicable to other FL approaches by demonstrating that FedBalancer improves the convergence speed and accuracy when operating jointly with three different FL algorithms.
    Joint Learning-Based Stabilization of Multiple Unknown Linear Systems. (arXiv:2201.01387v1 [eess.SY])
    (2 min) Learning-based control of linear systems received a lot of attentions recently. In popular settings, the true dynamical models are unknown to the decision-maker and need to be interactively learned by applying control inputs to the systems. Unlike the matured literature of efficient reinforcement learning policies for adaptive control of a single system, results on joint learning of multiple systems are not currently available. Especially, the important problem of fast and reliable joint-stabilization remains unaddressed and so is the focus of this work. We propose a novel joint learning-based stabilization algorithm for quickly learning stabilizing policies for all systems understudy, from the data of unstable state trajectories. The presented procedure is shown to be notably effective such that it stabilizes the family of dynamical systems in an extremely short time period.
    Offsetting Unequal Competition through RL-assisted Incentive Schemes. (arXiv:2201.01450v1 [cs.GT])
    (2 min) This paper investigates the dynamics of competition among organizations with unequal expertise. Multi-agent reinforcement learning has been used to simulate and understand the impact of various incentive schemes designed to offset such inequality. We design Touch-Mark, a game based on well-known multi-agent-particle-environment, where two teams (weak, strong) with unequal but changing skill levels compete against each other. For training such a game, we propose a novel controller assisted multi-agent reinforcement learning algorithm \our\, which empowers each agent with an ensemble of policies along with a supervised controller that by selectively partitioning the sample space, triggers intelligent role division among the teammates. Using C-MADDPG as an underlying framework, we propose an incentive scheme for the weak team such that the final rewards of both teams become the same. We find that in spite of the incentive, the final reward of the weak team falls short of the strong team. On inspecting, we realize that an overall incentive scheme for the weak team does not incentivize the weaker agents within that team to learn and improve. To offset this, we now specially incentivize the weaker player to learn and as a result, observe that the weak team beyond an initial phase performs at par with the stronger team. The final goal of the paper has been to formulate a dynamic incentive scheme that continuously balances the reward of the two teams. This is achieved by devising an incentive scheme enriched with an RL agent which takes minimum information from the environment.
    ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling. (arXiv:2201.01337v1 [cs.CL])
    (2 min) Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data, Zero-Shot Learning, Topic Modeling, Transformers.
    Linear Variational State Space Filtering. (arXiv:2201.01353v1 [cs.LG])
    (2 min) We introduce Variational State-Space Filters (VSSF), a new method for unsupervised learning, identification, and filtering of latent Markov state space models from raw pixels. We present a theoretically sound framework for latent state space inference under heterogeneous sensor configurations. The resulting model can integrate an arbitrary subset of the sensor measurements used during training, enabling the learning of semi-supervised state representations, thus enforcing that certain components of the learned latent state space to agree with interpretable measurements. From this framework we derive L-VSSF, an explicit instantiation of this model with linear latent dynamics and Gaussian distribution parameterizations. We experimentally demonstrate L-VSSF's ability to filter in latent space beyond the sequence length of the training dataset across several different test environments.
    Learning Differentiable Safety-Critical Control using Control Barrier Functions for Generalization to Novel Environments. (arXiv:2201.01347v1 [eess.SY])
    (2 min) Control barrier functions (CBFs) have become a popular tool to enforce safety of a control system. CBFs are commonly utilized in a quadratic program formulation (CBF-QP) as safety-critical constraints. A class $\mathcal{K}$ function in CBFs usually needs to be tuned manually in order to balance the trade-off between performance and safety for each environment. However, this process is often heuristic and can become intractable for high relative-degree systems. Moreover, it prevents the CBF-QP from generalizing to different environments in the real world. By embedding the optimization procedure of the CBF-QP as a differentiable layer within a deep learning architecture, we propose a differentiable optimization-based safety-critical control framework that enables generalization to new environments with forward invariance guarantees. Finally, we validate the proposed control design with 2D double and quadruple integrator systems in various environments.
    The CAMELS project: public data release. (arXiv:2201.01300v1 [astro-ph.CO])
    (2 min) The Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project was developed to combine cosmology with astrophysics through thousands of cosmological hydrodynamic simulations and machine learning. CAMELS contains 4,233 cosmological simulations, 2,049 N-body and 2,184 state-of-the-art hydrodynamic simulations that sample a vast volume in parameter space. In this paper we present the CAMELS public data release, describing the characteristics of the CAMELS simulations and a variety of data products generated from them, including halo, subhalo, galaxy, and void catalogues, power spectra, bispectra, Lyman-$\alpha$ spectra, probability distribution functions, halo radial profiles, and X-rays photon lists. We also release over one thousand catalogues that contain billions of galaxies from CAMELS-SAM: a large collection of N-body simulations that have been combined with the Santa Cruz Semi-Analytic Model. We release all the data, comprising more than 350 terabytes and containing 143,922 snapshots, millions of halos, galaxies and summary statistics. We provide further technical details on how to access, download, read, and process the data at \url{https://camels.readthedocs.io}.
    Semantic Communications: Principles and Challenges. (arXiv:2201.01389v1 [cs.IT])
    (2 min) Semantic communication, regarded as the breakthrough beyond Shannon paradigm, aims at the successful transmission of semantic information conveyed by the source rather than the accurate reception of each single symbol or bit regardless of its meaning. This article provides an overview on semantic communications. After a brief review on Shannon information theory, we discuss semantic communications with theory, frameworks, and system design enabled by deep learning. Different from the symbol/bit error rate used for measuring the conventional communication systems, new performance metrics for semantic communications are also discussed. The article is concluded by several open questions.
    Graph Decipher: A transparent dual-attention graph neural network to understand the message-passing mechanism for the node classification. (arXiv:2201.01381v1 [cs.LG])
    (2 min) Graph neural networks can be effectively applied to find solutions for many real-world problems across widely diverse fields. The success of graph neural networks is linked to the message-passing mechanism on the graph, however, the message-aggregating behavior is still not entirely clear in most algorithms. To improve functionality, we propose a new transparent network called Graph Decipher to investigate the message-passing mechanism by prioritizing in two main components: the graph structure and node attributes, at the graph, feature, and global levels on a graph under the node classification task. However, the computation burden now becomes the most significant issue because the relevance of both graph structure and node attributes are computed on a graph. In order to solve this issue, only relevant representative node attributes are extracted by graph feature filters, allowing calculations to be performed in a category-oriented manner. Experiments on seven datasets show that Graph Decipher achieves state-of-the-art performance while imposing a substantially lower computation burden under the node classification task. Additionally, since our algorithm has the ability to explore the representative node attributes by category, it is utilized to alleviate the imbalanced node classification problem on multi-class graph datasets.
    Test and Evaluation of Quadrupedal Walking Gaits through Sim2Real Gap Quantification. (arXiv:2201.01323v1 [eess.SY])
    (2 min) In this letter, the authors propose a two-step approach to evaluate and verify a true system's capacity to satisfy its operational objective. Specifically, whenever the system objective has a quantifiable measure of satisfaction, i.e. a signal temporal logic specification, a barrier function, etc - the authors develop two separate optimization problems solvable via a Bayesian Optimization procedure detailed within. This dual approach has the added benefit of quantifying the Sim2Real Gap between a system simulator and its hardware counterpart. Our contributions are twofold. First, we show repeatability with respect to our outlined optimization procedure in solving these optimization problems. Second, we show that the same procedure can discriminate between different environments by identifying the Sim2Real Gap between a simulator and its hardware counterpart operating in different environments.
    Towards Understanding Quality Challenges of the Federated Learning: A First Look from the Lens of Robustness. (arXiv:2201.01409v1 [cs.LG])
    (2 min) Federated learning (FL) is a widely adopted distributed learning paradigm in practice, which intends to preserve users' data privacy while leveraging the entire dataset of all participants for training. In FL, multiple models are trained independently on the users and aggregated centrally to update a global model in an iterative process. Although this approach is excellent at preserving privacy by design, FL still tends to suffer from quality issues such as attacks or byzantine faults. Some recent attempts have been made to address such quality challenges on the robust aggregation techniques for FL. However, the effectiveness of state-of-the-art (SOTA) robust FL techniques is still unclear and lacks a comprehensive study. Therefore, to better understand the current quality status and challenges of these SOTA FL techniques in the presence of attacks and faults, in this paper, we perform a large-scale empirical study to investigate the SOTA FL's quality from multiple angles of attacks, simulated faults (via mutation operators), and aggregation (defense) methods. In particular, we perform our study on two generic image datasets and one real-world federated medical image dataset. We also systematically investigate the effect of the distribution of attacks/faults over users and the independent and identically distributed (IID) factors, per dataset, on the robustness results. After a large-scale analysis with 496 configurations, we find that most mutators on each individual user have a negligible effect on the final model. Moreover, choosing the most robust FL aggregator depends on the attacks and datasets. Finally, we illustrate that it is possible to achieve a generic solution that works almost as well or even better than any single aggregator on all attacks and configurations with a simple ensemble model of aggregators.
    Latent Vector Expansion using Autoencoder for Anomaly Detection. (arXiv:2201.01416v1 [cs.CV])
    (2 min) Deep learning methods can classify various unstructured data such as images, language, and voice as input data. As the task of classifying anomalies becomes more important in the real world, various methods exist for classifying using deep learning with data collected in the real world. As the task of classifying anomalies becomes more important in the real world, there are various methods for classifying using deep learning with data collected in the real world. Among the various methods, the representative approach is a method of extracting and learning the main features based on a transition model from pre-trained models, and a method of learning an autoencoderbased structure only with normal data and classifying it as abnormal through a threshold value. However, if the dataset is imbalanced, even the state-of-the-arts models do not achieve good performance. This can be addressed by augmenting normal and abnormal features in imbalanced data as features with strong distinction. We use the features of the autoencoder to train latent vectors from low to high dimensionality. We train normal and abnormal data as a feature that has a strong distinction among the features of imbalanced data. We propose a latent vector expansion autoencoder model that improves classification performance at imbalanced data. The proposed method shows performance improvement compared to the basic autoencoder using imbalanced anomaly dataset.
    Deep learning for location based beamforming with NLOS channels. (arXiv:2201.01386v1 [cs.IT])
    (2 min) Massive MIMO systems are highly efficient but critically rely on accurate channel state information (CSI) at the base station in order to determine appropriate precoders. CSI acquisition requires sending pilot symbols which induce an important overhead. In this paper, a method whose objective is to determine an appropriate precoder from the knowledge of the user's location only is proposed. Such a way to determine precoders is known as location based beamforming. It allows to reduce or even eliminate the need for pilot symbols, depending on how the location is obtained. the proposed method learns a direct mapping from location to precoder in a supervised way. It involves a neural network with a specific structure based on random Fourier features allowing to learn functions containing high spatial frequencies. It is assessed empirically and yields promising results on realistic synthetic channels. As opposed to previously proposed methods, it allows to handle both line-of-sight (LOS) and non-line-of-sight (NLOS) channels.
    Balsa: Learning a Query Optimizer Without Expert Demonstrations. (arXiv:2201.01441v1 [cs.DB])
    (2 min) Query optimizers are a performance-critical component in every database system. Due to their complexity, optimizers take experts months to write and years to refine. In this work, we demonstrate for the first time that learning to optimize queries without learning from an expert optimizer is both possible and efficient. We present Balsa, a query optimizer built by deep reinforcement learning. Balsa first learns basic knowledge from a simple, environment-agnostic simulator, followed by safe learning in real execution. On the Join Order Benchmark, Balsa matches the performance of two expert query optimizers, both open-source and commercial, with two hours of learning, and outperforms them by up to 2.8$\times$ in workload runtime after a few more hours. Balsa thus opens the possibility of automatically learning to optimize in future compute environments where expert-designed optimizers do not exist.
    Problem-dependent attention and effort in neural networks with an application to image resolution. (arXiv:2201.01415v1 [cs.CV])
    (2 min) This paper introduces a new neural network-based estimation approach that is inspired by the biological phenomenon whereby humans and animals vary the levels of attention and effort that they dedicate to a problem depending upon its difficulty. The proposed approach leverages alternate models' internal levels of confidence in their own projections. If the least costly model is confident in its classification, then that is the classification used; if not, the model with the next lowest cost of implementation is run, and so on. This use of successively more complex models -- together with the models' internal propensity scores to evaluate their likelihood of being correct -- makes it possible to substantially reduce resource use while maintaining high standards for classification accuracy. The approach is applied to the digit recognition problem from Google's Street View House Numbers dataset, using Multilayer Perceptron (MLP) neural networks trained on high- and low-resolution versions of the digit images. The algorithm examines the low-resolution images first, only moving to higher resolution images if the classification from the initial low-resolution pass does not have a high degree of confidence. For the MLPs considered here, this sequential approach enables a reduction in resource usage of more than 50\% without any sacrifice in classification accuracy.
    Bridging Adversarial and Nonstationary Multi-armed Bandit. (arXiv:2201.01628v1 [cs.LG])
    (2 min) In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.
    Efficient-Dyn: Dynamic Graph Representation Learning via Event-based Temporal Sparse Attention Network. (arXiv:2201.01384v1 [cs.LG])
    (2 min) Static graph neural networks have been widely used in modeling and representation learning of graph structure data. However, many real-world problems, such as social networks, financial transactions, recommendation systems, etc., are dynamic, that is, nodes and edges are added or deleted over time. Therefore, in recent years, dynamic graph neural networks have received more and more attention from researchers. In this work, we propose a novel dynamic graph neural network, Efficient-Dyn. It adaptively encodes temporal information into a sequence of patches with an equal amount of temporal-topological structure. Therefore, while avoiding the use of snapshots to cause information loss, it also achieves a finer time granularity, which is close to what continuous networks could provide. In addition, we also designed a lightweight module, Sparse Temporal Transformer, to compute node representations through both structural neighborhoods and temporal dynamics. Since the fully-connected attention conjunction is simplified, the computation cost is far lower than the current state-of-the-arts. Link prediction experiments are conducted on both continuous and discrete graph datasets. Through comparing with several state-of-the-art graph embedding baselines, the experimental results demonstrate that Efficient-Dyn has a faster inference speed while having competitive performance.
    Self-Supervised Approach to Addressing Zero-Shot Learning Problem. (arXiv:2201.01391v1 [cs.CV])
    (2 min) In recent years, self-supervised learning has had significant success in applications involving computer vision and natural language processing. The type of pretext task is important to this boost in performance. One common pretext task is the measure of similarity and dissimilarity between pairs of images. In this scenario, the two images that make up the negative pair are visibly different to humans. However, in entomology, species are nearly indistinguishable and thus hard to differentiate. In this study, we explored the performance of a Siamese neural network using contrastive loss by learning to push apart embeddings of bumblebee species pair that are dissimilar, and pull together similar embeddings. Our experimental results show a 61% F1-score on zero-shot instances, a performance showing 11% improvement on samples of classes that share intersections with the training set.
  • stat.ML updates on arXiv.org

    Tracking Most Severe Arm Changes in Bandits. (arXiv:2112.13838v2 [cs.LG] UPDATED)
    (2 min) In bandits with distribution shifts, one aims to automatically detect an unknown number $L$ of changes in reward distribution, and restart exploration when necessary. While this problem remained open for many years, a recent breakthrough of Auer et al. (2018, 2019) provide the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, with no knowledge of $L$. However, not all distributional shifts are equally severe, e.g., suppose no best arm switches occur, then we cannot rule out that a regret $O(\sqrt{T})$ may remain possible; in other words, is it possible to achieve dynamic regret that optimally scales only with an unknown number of severe shifts? This unfortunately has remained elusive, despite various attempts (Auer et al., 2019, Foster et al., 2020). We resolve this problem in the case of two-armed bandits: we derive an adaptive procedure that guarantees a dynamic regret of order $\tilde{O}(\sqrt{\tilde{L} T})$, where $\tilde L \ll L$ captures an unknown number of severe best arm changes, i.e., with significant switches in rewards, and which last sufficiently long to actually require a restart. As a consequence, for any number $L$ of distributional shifts outside of these severe shifts, our procedure achieves regret just $\tilde{O}(\sqrt{T})\ll \tilde{O}(\sqrt{LT})$. Finally, we note that our notion of severe shift applies in both classical settings of stochastic switching bandits and of adversarial bandits.
    Deep Reinforcement Learning at the Edge of the Statistical Precipice. (arXiv:2108.13264v4 [cs.LG] UPDATED)
    (3 min) Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.
    Systematic assessment of the quality of fit of the stochastic block model for empirical networks. (arXiv:2201.01658v1 [physics.soc-ph])
    (2 min) We perform a systematic analysis of the quality of fit of the stochastic block model (SBM) for 275 empirical networks spanning a wide range of domains and orders of size magnitude. We employ posterior predictive model checking as a criterion to assess the quality of fit, which involves comparing networks generated by the inferred model with the empirical network, according to a set of network descriptors. We observe that the SBM is capable of providing an accurate description for the majority of networks considered, but falls short of saturating all modeling requirements. In particular, networks possessing a large diameter and slow-mixing random walks tend to be badly described by the SBM. However, contrary to what is often assumed, networks with a high abundance of triangles can be well described by the SBM in many cases. We demonstrate that simple network descriptors can be used to evaluate whether or not the SBM can provide a sufficiently accurate representation, potentially pointing to possible model extensions that can systematically improve the expressiveness of this class of models.
    New Hard-thresholding Rules based on Data Splitting in High-dimensional Imbalanced Classification. (arXiv:2111.03306v2 [stat.ME] UPDATED)
    (2 min) In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions. We show that due to data scarcity in one class, referred to as the minority class, and high-dimensionality of the feature space, the LDA ignores the minority class yielding a maximum misclassification rate. We then propose a new construction of hard-thresholding rules based on a data splitting technique that reduces the large difference between the misclassification rates. We show that the proposed method is asymptotically optimal. We further study two well-known sparse versions of the LDA in imbalanced cases. We evaluate the finite-sample performance of different methods using simulations and by analyzing two real data sets. The results show that our method either outperforms its competitors or has comparable performance based on a much smaller subset of selected features, while being computationally more efficient.
    FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches. (arXiv:2103.14539v3 [cs.LG] UPDATED)
    (2 min) The machine learning (ML) life cycle involves a series of iterative steps, from the effective gathering and preparation of the data, including complex feature engineering processes, to the presentation and improvement of results, with various algorithms to choose from in every step. Feature engineering in particular can be very beneficial for ML, leading to numerous improvements such as boosting the predictive results, decreasing computational times, reducing excessive noise, and increasing the transparency behind the decisions taken during the training. Despite that, while several visual analytics tools exist to monitor and control the different stages of the ML life cycle (especially those related to data and algorithms), feature engineering support remains inadequate. In this paper, we present FeatureEnVi, a visual analytics system specifically designed to assist with the feature engineering process. Our proposed system helps users to choose the most important feature, to transform the original features into powerful alternatives, and to experiment with different feature generation combinations. Additionally, data space slicing allows users to explore the impact of features on both local and global scales. FeatureEnVi utilizes multiple automatic feature selection techniques; furthermore, it visually guides users with statistical evidence about the influence of each feature (or subsets of features). The final outcome is the extraction of heavily engineered features, evaluated by multiple validation metrics. The usefulness and applicability of FeatureEnVi are demonstrated with two use cases and a case study. We also report feedback from interviews with two ML experts and a visualization researcher who assessed the effectiveness of our system.
    Meta Two-Sample Testing: Learning Kernels for Testing with Limited Data. (arXiv:2106.07636v2 [stat.ML] UPDATED)
    (2 min) Modern kernel-based two-sample tests have shown great success in distinguishing complex, high-dimensional distributions with appropriate learned kernels. Previous work has demonstrated that this kernel learning procedure succeeds, assuming a considerable number of observed samples from each distribution. In realistic scenarios with very limited numbers of data samples, however, it can be challenging to identify a kernel powerful enough to distinguish complex distributions. We address this issue by introducing the problem of meta two-sample testing (M2ST), which aims to exploit (abundant) auxiliary data on related tasks to find an algorithm that can quickly identify a powerful test on new target tasks. We propose two specific algorithms for this task: a generic scheme which improves over baselines and a more tailored approach which performs even better. We provide both theoretical justification and empirical evidence that our proposed meta-testing schemes out-perform learning kernel-based tests directly from scarce observations, and identify when such schemes will be successful.
    Online Adaptation to Label Distribution Shift. (arXiv:2107.04520v2 [cs.LG] UPDATED)
    (2 min) Machine learning models often encounter distribution shifts when deployed in the real world. In this paper, we focus on adaptation to label distribution shift in the online setting, where the test-time label distribution is continually changing and the model must dynamically adapt to it without observing the true label. Leveraging a novel analysis, we show that the lack of true label does not hinder estimation of the expected test loss, which enables the reduction of online label shift adaptation to conventional online learning. Informed by this observation, we propose adaptation algorithms inspired by classical online learning techniques such as Follow The Leader (FTL) and Online Gradient Descent (OGD) and derive their regret bounds. We empirically verify our findings under both simulated and real world label distribution shifts and show that OGD is particularly effective and robust to a variety of challenging label shift scenarios.
    Algorithmic Stability and Generalization of an Unsupervised Feature Selection Algorithm. (arXiv:2010.09416v2 [cs.LG] UPDATED)
    (2 min) Feature selection, as a vital dimension reduction technique, reduces data dimension by identifying an essential subset of input features, which can facilitate interpretable insights into learning and inference processes. Algorithmic stability is a key characteristic of an algorithm regarding its sensitivity to perturbations of input samples. In this paper, we propose an innovative unsupervised feature selection algorithm attaining this stability with provable guarantees. The architecture of our algorithm consists of a feature scorer and a feature selector. The scorer trains a neural network (NN) to globally score all the features, and the selector adopts a dependent sub-NN to locally evaluate the representation abilities for selecting features. Further, we present algorithmic stability analysis and show that our algorithm has a performance guarantee via a generalization error bound. Extensive experimental results on real-world datasets demonstrate superior generalization performance of our proposed algorithm to strong baseline methods. Also, the properties revealed by our theoretical analysis and the stability of our algorithm-selected features are empirically confirmed.
    Online certification of preference-based fairness for personalized recommender systems. (arXiv:2104.14527v4 [cs.LG] UPDATED)
    (2 min) Recommender systems are facing scrutiny because of their growing impact on the opportunities we have access to. Current audits for fairness are limited to coarse-grained parity assessments at the level of sensitive groups. We propose to audit for envy-freeness, a more granular criterion aligned with individual preferences: every user should prefer their recommendations to those of other users. Since auditing for envy requires to estimate the preferences of users beyond their existing recommendations, we cast the audit as a new pure exploration problem in multi-armed bandits. We propose a sample-efficient algorithm with theoretical guarantees that it does not deteriorate user experience. We also study the trade-offs achieved on real-world recommendation datasets.
    The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems. (arXiv:2006.07856v3 [cs.LG] UPDATED)
    (2 min) This paper presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available data sets as different data silos in image, text and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of federated learning systems. We have developed reference implementations, and evaluated the important aspects of federated learning, including model accuracy, communication cost, throughput and convergence time. Through these evaluations, we discovered some interesting findings such as federated learning can effectively increase end-to-end throughput.
    Stein Latent Optimization for Generative Adversarial Networks. (arXiv:2106.05319v4 [cs.LG] UPDATED)
    (2 min) Generative adversarial networks (GANs) with clustered latent spaces can perform conditional generation in a completely unsupervised manner. In the real world, the salient attributes of unlabeled data can be imbalanced. However, most of existing unsupervised conditional GANs cannot cluster attributes of these data in their latent spaces properly because they assume uniform distributions of the attributes. To address this problem, we theoretically derive Stein latent optimization that provides reparameterizable gradient estimations of the latent distribution parameters assuming a Gaussian mixture prior in a continuous latent space. Structurally, we introduce an encoder network and novel unsupervised conditional contrastive loss to ensure that data generated from a single mixture component represent a single attribute. We confirm that the proposed method, named Stein Latent Optimization for GANs (SLOGAN), successfully learns balanced or imbalanced attributes and achieves state-of-the-art unsupervised conditional generation performance even in the absence of attribute information (e.g., the imbalance ratio). Moreover, we demonstrate that the attributes to be learned can be manipulated using a small amount of probe data.
    Matrix Completion with Hierarchical Graph Side Information. (arXiv:2201.01728v1 [stat.ML])
    (2 min) We consider a matrix completion problem that exploits social or item similarity graphs as side information. We develop a universal, parameter-free, and computationally efficient algorithm that starts with hierarchical graph clustering and then iteratively refines estimates both on graph clustering and matrix ratings. Under a hierarchical stochastic block model that well respects practically-relevant social graphs and a low-rank rating matrix model (to be detailed), we demonstrate that our algorithm achieves the information-theoretic limit on the number of observed matrix entries (i.e., optimal sample complexity) that is derived by maximum likelihood estimation together with a lower-bound impossibility result. One consequence of this result is that exploiting the hierarchical structure of social graphs yields a substantial gain in sample complexity relative to the one that simply identifies different groups without resorting to the relational structure across them. We conduct extensive experiments both on synthetic and real-world datasets to corroborate our theoretical results as well as to demonstrate significant performance improvements over other matrix completion algorithms that leverage graph side information.
    Optimal Transport using GANs for Lineage Tracing. (arXiv:2007.12098v3 [cs.LG] UPDATED)
    (2 min) In this paper, we present Super-OT, a novel approach to computational lineage tracing that combines a supervised learning framework with optimal transport based on Generative Adversarial Networks (GANs). Unlike previous approaches to lineage tracing, Super-OT has the flexibility to integrate paired data. We benchmark Super-OT based on single-cell RNA-seq data against Waddington-OT, a popular approach for lineage tracing that also employs optimal transport. We show that Super-OT achieves gains over Waddington-OT in predicting the class outcome of cells during differentiation, since it allows the integration of additional information during training.
    Asymptotics of $\ell_2$ Regularized Network Embeddings. (arXiv:2201.01689v1 [stat.ML])
    (2 min) A common approach to solving tasks, such as node classification or link prediction, on a large network begins by learning a Euclidean embedding of the nodes of the network, from which regular machine learning methods can be applied. For unsupervised random walk methods such as DeepWalk and node2vec, adding a $\ell_2$ penalty on the embedding vectors to the loss leads to improved downstream task performance. In this paper we study the effects of this regularization and prove that, under exchangeability assumptions on the graph, it asymptotically leads to learning a nuclear-norm-type penalized graphon. In particular, the exact form of the penalty depends on the choice of subsampling method used within stochastic gradient descent to learn the embeddings. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, if not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner.
    Partial Separability and Functional Graphical Models for Multivariate Gaussian Processes. (arXiv:1910.03134v4 [stat.ME] UPDATED)
    (2 min) The covariance structure of multivariate functional data can be highly complex, especially if the multivariate dimension is large, making extensions of statistical methods for standard multivariate data to the functional data setting challenging. For example, Gaussian graphical models have recently been extended to the setting of multivariate functional data by applying multivariate methods to the coefficients of truncated basis expansions. However, a key difficulty compared to multivariate data is that the covariance operator is compact, and thus not invertible. The methodology in this paper addresses the general problem of covariance modeling for multivariate functional data, and functional Gaussian graphical models in particular. As a first step, a new notion of separability for the covariance operator of multivariate functional data is proposed, termed partial separability, leading to a novel Karhunen-Lo\`eve-type expansion for such data. Next, the partial separability structure is shown to be particularly useful in order to provide a well-defined functional Gaussian graphical model that can be identified with a sequence of finite-dimensional graphical models, each of identical fixed dimension. This motivates a simple and efficient estimation procedure through application of the joint graphical lasso. Empirical performance of the method for graphical model estimation is assessed through simulation and analysis of functional brain connectivity during a motor task. %Empirical performance of the method for graphical model estimation is assessed through simulation and analysis of functional brain connectivity during a motor task.
    Bridging Adversarial and Nonstationary Multi-armed Bandit. (arXiv:2201.01628v1 [cs.LG])
    (2 min) In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.
    Boosting Algorithms for Delivery Time Prediction in Transportation Logistics. (arXiv:2009.11598v2 [cs.LG] UPDATED)
    (2 min) Travel time is a crucial measure in transportation. Accurate travel time prediction is also fundamental for operation and advanced information systems. A variety of solutions exist for short-term travel time predictions such as solutions that utilize real-time GPS data and optimization methods to track the path of a vehicle. However, reliable long-term predictions remain challenging. We show in this paper the applicability and usefulness of travel time i.e. delivery time prediction for postal services. We investigate several methods such as linear regression models and tree based ensembles such as random forest, bagging, and boosting, that allow to predict delivery time by conducting extensive experiments and considering many usability scenarios. Results reveal that travel time prediction can help mitigate high delays in postal services. We show that some boosting algorithms, such as light gradient boosting and catboost, have a higher performance in terms of accuracy and runtime efficiency than other baselines such as linear regression models, bagging regressor and random forest.
    Adversarial Feature Desensitization. (arXiv:2006.04621v3 [cs.LG] UPDATED)
    (2 min) Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths. Our code is available at https://github.com/BashivanLab/afd.
    Regret Lower Bounds for Learning Linear Quadratic Gaussian Systems. (arXiv:2201.01680v1 [cs.LG])
    (2 min) This paper presents local minimax regret lower bounds for adaptively controlling linear-quadratic-Gaussian (LQG) systems. We consider smoothly parametrized instances and provide an understanding of when logarithmic regret is impossible which is both instance specific and flexible enough to take problem structure into account. This understanding relies on two key notions: That of local-uninformativeness; when the optimal policy does not provide sufficient excitation for identification of the optimal policy, and yields a degenerate Fisher information matrix; and that of information-regret-boundedness, when the small eigenvalues of a policy-dependent information matrix are boundable in terms of the regret of that policy. Combined with a reduction to Bayesian estimation and application of Van Trees' inequality, these two conditions are sufficient for proving regret bounds on order of magnitude $\sqrt{T}$ in the time horizon, $T$. This method yields lower bounds that exhibit tight dimensional dependencies and scale naturally with control-theoretic problem constants. For instance, we are able to prove that systems operating near marginal stability are fundamentally hard to learn to control. We further show that large classes of systems satisfy these conditions, among them any state-feedback system with both $A$- and $B$-matrices unknown. Most importantly, we also establish that a nontrivial class of partially observable systems, essentially those that are over-actuated, satisfy these conditions, thus providing a $\sqrt{T}$ lower bound also valid for partially observable systems. Finally, we turn to two simple examples which demonstrate that our lower bound captures classical control-theoretic intuition: our lower bounds diverge for systems operating near marginal stability or with large filter gain -- these can be arbitrarily hard to (learn to) control.
    Understanding Entropy Coding With Asymmetric Numeral Systems (ANS): a Statistician's Perspective. (arXiv:2201.01741v1 [stat.ML])
    (2 min) Entropy coding is the backbone data compression. Novel machine-learning based compression methods often use a new entropy coder called Asymmetric Numeral Systems (ANS) [Duda et al., 2015], which provides very close to optimal bitrates and simplifies [Townsend et al., 2019] advanced compression techniques such as bits-back coding. However, researchers with a background in machine learning often struggle to understand how ANS works, which prevents them from exploiting its full versatility. This paper is meant as an educational resource to make ANS more approachable by presenting it from a new perspective of latent variable models and the so-called bits-back trick. We guide the reader step by step to a complete implementation of ANS in the Python programming language, which we then generalize for more advanced use cases. We also present and empirically evaluate an open-source library of various entropy coders designed for both research and production use. Related teaching videos and problem sets are available online.
    Inverse Extended Kalman Filter. (arXiv:2201.01539v1 [math.OC])
    (2 min) Recent advances in counter-adversarial systems have garnered significant research interest in inverse filtering from a Bayesian perspective. For example, interest in estimating the adversary's Kalman filter tracked estimate with the purpose of predicting the adversary's future steps has led to recent formulations of inverse Kalman filter (I-KF). In this context of inverse filtering, we address the key challenges of nonlinear process dynamics and unknown input to the forward filter by proposing inverse extended Kalman filter (I-EKF). We derive I-EKF with and without an unknown input by considering nonlinearity in both forward and inverse state-space models. In the process, I-KF-with-unknown-input is also obtained. We then provide theoretical stability guarantees using both bounded nonlinearity and unknown matrix approaches. We further generalize these formulations and results to the case of higher-order, Gaussian-sum, and dithered I-EKFs. Numerical experiments validate our methods for various proposed inverse filters using the recursive Cram\'er-Rao lower bound as a benchmark.
    Convergence and Complexity of Stochastic Block Majorization-Minimization. (arXiv:2201.01652v1 [math.OC])
    (2 min) Stochastic majorization-minimization (SMM) is an online extension of the classical principle of majorization-minimization, which consists of sampling i.i.d. data points from a fixed data distribution and minimizing a recursively defined majorizing surrogate of an objective function. In this paper, we introduce stochastic block majorization-minimization, where the surrogates can now be only block multi-convex and a single block is optimized at a time within a diminishing radius. Relaxing the standard strong convexity requirements for surrogates in SMM, our framework gives wider applicability including online CANDECOMP/PARAFAC (CP) dictionary learning and yields greater computational efficiency especially when the problem dimension is large. We provide an extensive convergence analysis on the proposed algorithm, which we derive under possibly dependent data streams, relaxing the standard i.i.d. assumption on data samples. We show that the proposed algorithm converges almost surely to the set of stationary points of a nonconvex objective under constraints at a rate $O((\log n)^{1+\eps}/n^{1/2})$ for the empirical loss function and $O((\log n)^{1+\eps}/n^{1/4})$ for the expected loss function, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $O((\log n)^{1+\eps}/n^{1/2})$. Our results provide first convergence rate bounds for various online matrix and tensor decomposition algorithms under a general Markovian data setting.
  • The Official NVIDIA Blog

    Scooping up Customers: Startup’s No-Code AI Gains Traction for Industrial Inspection
    (3 min) Bill Kish founded Ruckus Wireless two decades ago to make Wi-Fi networking easier. Now, he’s doing the same for computer vision in industrial AI. In 2015, Kish started Cogniac, a company that offers a self-service computer vision platform and development support. Like in the early days of Wi-Fi deployment, the rollout of AI is challenging, Read article > The post Scooping up Customers: Startup’s No-Code AI Gains Traction for Industrial Inspection appeared first on The Official NVIDIA Blog.
  • AI Weirdness

    Remember those wooden playgrounds? AI doesn't.
    (3 min) I remember when my hometown got one of these giant wooden playgrounds. It must have been in the early 90s and a kid could get lost for hours in there. Or injured or full of splinters or chromated copper arsenate I guess, which is why you don't see
    Bonus: these playgrounds might not be exactly safe
    (1 min) AI Weirdness: the strange side of machine learning
  • AWS Machine Learning Blog

    Blur faces in videos automatically with Amazon Rekognition Video
    (8 min) With the advent of artificial intelligence (AI) and machine learning (ML), customers and the general public have become increasingly aware of their privacy, as well as the value that it holds in today’s data-driven world. Enterprises are actively seeking out and marketing privacy-first solutions, especially in the Computer Vision (CV) domain. They need to reassure […]
  • Artificial Intelligence

    Does anyone know what ai software may have been used to make this? 👌🌸🥰
    (0 min) submitted by /u/6owline1vex [link] [comments]
    This Is The Reason Why Bruce Willis Is Licensing Deepfake Image Rights
    (0 min) submitted by /u/sopadebombillas [link] [comments]
    Record Chess Moves with Webcam for Online Play
    (0 min) Hello! I would like to share with you my project which records moves made on your real life chess board using webcam. It also transmits the moves to any chess website for playing online. https://github.com/karayaman/Play-online-chess-with-real-chess-board submitted by /u/Savings_Ad904 [link] [comments]
    [R] Baidu’s 10-Billion Scale ERNIE-ViLG Unified Generative Pretraining Framework Achieves SOTA Performance on Bidirectional Vision-Language Generation Tasks
    (1 min) Baidu researchers propose ERNIE-ViLG, a 10-billion parameter scale pretraining framework for bidirectional text-image generation. Pretrained on 145 million (Chinese) image-text pairs, ERNIE-ViLG achieves state-of-the-art performance on both text-to-image and image-to-text generation tasks. Here is a quick read: Baidu’s 10-Billion Scale ERNIE-ViLG Unified Generative Pretraining Framework Achieves SOTA Performance on Bidirectional Vision-Language Generation Tasks. The paper ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Development & Application Of Responsible AI For Allied Defense & Security - Dr. Nikos Loutas Ph.D., Head of Data & ArtificiaI Intelligence (AI) Policy, NATO
    (1 min) submitted by /u/ObjectiveGround5 [link] [comments]
    What are the biggest AI events in Europe?
    (1 min) I want to start attending AI trade shows and conferences in Europe, but there seem to be many and I don't know where to start. I'm particularly interested in applying to be a speaker so I can share my thesis insights with peers. Thanks in advance! submitted by /u/kerfufflewhoople [link] [comments]
    Upscale images
    (1 min) Hey what is the best way to upscale lots of 500x500x png images? Any website/program/colab recommendations? Thanks for the help submitted by /u/BananaDrum [link] [comments]
    I hope you find this video entertaining or informative! Artificial Intelligence: Life is Weird
    (0 min) submitted by /u/TonyTalksBackPodcast [link] [comments]
  • Machine Learning

    Neural Architecture Search (NAS) [D]
    (2 min) The goal of NAS is to find an optimal architecture of a neural network for a given problem. In my case, I am interested in creating an algorithm for finding the architecture of a Convolutional Neural Network for an Image dataset. The problem with any NAS algorithm is that the only true way to know if a created architecture is good is by running the model to convergence, thus increasing computational resources spent. There have been many possible solutions to gauge the relative trained accuracy of model. The most popular is to simply reduce the dataset, reuse training weights, and/or train for only a couple of iterations. I was wanting to hear some thoughts... I can save computational resources by only training each possible model for only 3 iterations and gauge its success using its training accuracy. The problem however with using training accuracy is that it does not gauge how well the model will generalize, only if it is capable of learning anything from the training data, if not overfitting. The problem with using the validation accuracy after 3 iterations is that most models have extremely poor validation accuracy scores early on during training, only catching up around iteration 10, therefore in order to fully gauged the validation accuracy the models would have to be trained until iteration 10, 3x more than using the training accuracy after 3 iterations. Would you rather train the models for only 3 iterations, using training accuracy as the metric to compare models, risking the models overfitting when trained until convergence; or, would you rather play the safe game and train models for 10 iterations using validation accuracy as a metric to compare models, but spend 3x the amount of computational resources? submitted by /u/MyActualUserName99 [link] [comments]
    [R] Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
    (0 min) submitted by /u/hardmaru [link] [comments]
    [D] Is model calibration important for industry practitioners?
    (2 min) There has been a lot of recent work on model calibration — ensuring a match between a model’s confidence and correctness. This seems critical in applications where confidence/safety matters (e.g. perception related parts of self-driving), but I’m curious whether model calibration is something important for industry practitioners. submitted by /u/Zestyclose-Orange468 [link] [comments]
    [P] ML Contests: Competitive ML Community Website + Discord
    (1 min) I posted here a while ago about a website I created that aggregates competitions across Kaggle and other websites: https://mlcontests.com (open source and community maintained - GitHub link at the bottom of the page) https://preview.redd.it/7poem4aqmaa81.png?width=1100&format=png&auto=webp&s=362666ed2bec33d88cdb2c03b84371891e29a928 Based on some feedback I implemented a few extra features, and today I launched a Discord community to go along with it - this allows people to find teammates to collaborate with on a competition, and to get real-time alerts when new competitions are launched. There are also a few team members from the competition platforms in the Discord already. https://preview.redd.it/5a0tzdjomaa81.png?width=673&format=png&auto=webp&s=c78df3326d6a8574eba2e1cd94245d6563f29df3 Also welcome is discussion about ongoing competitions, and post-mortems/debriefs after competitions end. If this sounds interesting to you, you can join the Discord here: https://discord.com/invite/nM482gc9h8 Hope to see you there! submitted by /u/hcarlens [link] [comments]
    [D] MC Dropout is not Bayesian, so why are so many papers still using it for that?
    (3 min) In the domain of uncertainty estimation in places like segmentation, MC Dropout is used as an approximation of Bayesian computations. I've seen a thread from a few years ago talking about its issues and I've just read this recent paper called Is MC Dropout Bayesian?, and they conclude it isn't. They propose a new method for parametric VI based on the reparametrization trick and stochastic backpropagation, but my maths isn't good enough to fully understand it. Does anyone have any thoughts on this? I find it a little strange that every uncertainty estimation paper in segmentation just uses MC Dropout as a default, with just a throwaway line that it is "approximately Bayesian", even though it isn't! submitted by /u/shellyturnwarm [link] [comments]
    How to Implement an Efficient LayerNorm CUDA Kernel[R]
    (1 min) Article:https://oneflow2020.medium.com/how-to-implement-an-efficient-layernorm-cuda-kernel-oneflow-performance-optimization-731e91a285b8 Code:https://github.com/Oneflow-Inc/oneflow/ In a previous article, we discussed OneFlow’s techniques for optimizing the Softmax CUDA Kernel. The performance of the OneFlow-optimized Softmax greatly exceeds that of the Softmax of CuDNN, and OneFlow also fully optimizes half types that many frameworks do not take into account. Here, we share OneFlow’s approach for optimizing the performance of another important operator, LayerNorm. submitted by /u/Just0by [link] [comments]
    [D] Neuro-Symbolic AI
    (1 min) Hello, I recently came across neuro-symbolic ai, and the concept looked really cool. However, except for a 2-3 papers from IBM-MIT research and a couple of blogs, I couldn’t find any work in this direction. Apart from VQA, what kind of potential domains can NS be applied to? Are there any caveats I am not aware of as to why not many from the community are not working in this direction? Thank you! submitted by /u/NightlessBaron [link] [comments]
    [D] GAN training initially degrades results of pre-trained generator
    (3 min) Hello, I have an issue with the training of a GAN, which consists of a generator and two discriminators, each discriminator working on one specific domain (i.e. time domain and time-frequency domain). The generator is used to generate waveforms, provided other waveforms (i.e. noisy waveform to de-noised waveform). ​ The whole GAN is trained in the following way: 1-The generator is independently pre-trained by regression, until validation losses reach a plateau. 2-The two randomly initialized discriminators are then activated, and GANs training takes place. ​ The output of step 1. is a raw version of the desired output. However, step 2 inititially degrades the output of the generator, by removing too much information. This is also reflected in all the validation losses, whose values…
    [D]DATACAMP & Data Leakage Practices. Does Datacamp need a quality audit?
    (2 min) I must be going crazy. Anyone take Datacamp courses and do projects? So quite a lot of projects and courses actually apply data preprocessing techniques before splitting data that encourage data leakage. I can't believe I paid for their services...... Anyone catch other bad habits that Datacamp encourages? Aside from being a rather ineffective platform to learn? submitted by /u/THE_REAL_ODB [link] [comments]
    [D] How to choose the appropriate network: Pretrained Image Recognition Models
    (2 min) Hi! I am currently involved in a project that needs me to classify multimodal data - In which one modality is image. I went through many papers on the topic and most of them just seem to choose a random pretrained architecture for the same - VGG19 being the most common. I found little to no reason for the same except for the performance being better than other models. So basically, I want to know whether there is a way to choose the right model? And if not, which one out of Inception, VGG and ResNet are the best for binary image classification on a large scale? Thanks! submitted by /u/prabhav55221 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Seq2Seq learning resources
    (0 min) I am interested in learning Seq2Seq family of ML/DL. Can you provide free learning resources to cover topics like RNN, LSTM, GRU, Transformer? Thanks submitted by /u/grid_world [link] [comments]
    Responsive ai dialogue for gemes with NN?
    (1 min) Hello, Currently I am learning NN algos, because I want to create a project, but during this time I got an idea for another project, I'm curious if its possible. Some of us have their favourite games of all times as do I, i.e. skyrim. And I got the second idea for the project. What if I created some kind of a "chat bot" that runs in the background in real time. It would allow player, to communicate with npcs with their own voice and ask them something that is not scripted in the game itself, but is "lore friendly", by this I mean no questions about trump, biden or tiktok lol. Of course the second step would be to synthesize the voices of npcs for their generated dialogue. What I'd like to know is if that is possible, not only from proggramimg perspecitve(cuz anything given enough time would be completed), but is it possible for modern computers to run? It is just a crazy idea of a begginer lol submitted by /u/skollehatti [link] [comments]
    Are there any good books or videos for beginners?
    (1 min) I’m only 15 and I don’t know much about how neural networks actually function, I have been very interested in this subject for a while and I wanted to learn more. Any recommendations? submitted by /u/Forever061 [link] [comments]
  • Reinforcement Learning

    Reinforcement learning for the real world
    (0 min) submitted by /u/bendee983 [link] [comments]
    Can't find where the agent is in the code from this paper (I am a newbie)
    (2 min) Hi all, I am trying to look at the code from a paper (https://github.com/anirudh9119/shared_workspace/tree/main/Triangle). I can code fluently and I know RL quite well in theory, the problem is I don't have much hands-on experience yet. Perhaps this is why I get confused when looking at this repo, not because I don't understand how the algorithm works, but because I can't find the code for it. Specifically, I would like to take the code for this algorithm: https://preview.redd.it/vip5wrq3p9a81.png?width=957&format=png&auto=webp&s=4735929cbbd651d1d26dcbdfcb33b2a99be28c16 Naively, I would expect to see one file with all the classes and functions needed for this agent, and then another file with the agent imported and the reinforcement learning task performed in a certain environment. Instead, in this repo there are so many files with so many classes and so many functions, that I get lost and I can't identify which code refers to this algorithm. I can't see where the agent is, I can't see where the task is performed and so on. I don't understand if the repo is actually messy, or if it's just due to the fact that I'm not interpreting it in the right way. Thanks! submitted by /u/No_Possibility_7588 [link] [comments]
    What is the current SOTA for Offline RL?
    (1 min) Hi everyone! I'm mostly interested in Offline RL approaches for environments with distribution shift. I'm reading Decision Transformer: Reinforcement Learning via Sequence Modeling (https://arxiv.org/abs/2106.01345) paper, and was wondering what would be the benchmark / SOTA right now? submitted by /u/fusionquant [link] [comments]
    Grab your digital copy of The TensorFlow Workshop
    (1 min) Packt has Published "The TensorFlow Workshop " ​ Grab your digital copy now if you feel you are interested. ​ As part of our marketing activities, we are offering free digital copies of the book in return for unbiased feedback in the form of a reader review. ​ Get started with TensorFlow fundamentals to build and train deep learning models with real-world data, practical exercises, and challenging activities. ​ Here is what you will learn from the book: ​ Get to grips with TensorFlow’s mathematical operations Pre-process a wide variety of tabular, sequential, and image data Understand the purpose and usage of different deep learning layers Perform hyperparameter-tuning to prevent overfitting of training data Use pre-trained models to speed up the development of learning models Generate new data based on existing patterns using generative models ​ ## Key Features ​ * Understand the fundamentals of tensors, neural networks, and deep learning * Discover how to implement and fine-tune deep learning models for real-world datasets * Build your experience and confidence with hands-on exercises and activities ​ Please comment below or DM me for more details submitted by /u/RoyluisRodrigues [link] [comments]
    How to draw the path of the end effector in mujoco_py
    (1 min) I’m working on a project where my robotic arm should trace a hand written letter. For me to show that my arm is indeed tracing out the letter it would be better if I can draw a line as the end effector moves. Does anyone know how can I do that? submitted by /u/the_loner_98 [link] [comments]
    "Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria", Kopparapu et al 2022 {DM} (sus)
    (1 min) submitted by /u/gwern [link] [comments]
    Defining actor/critic networks under one class
    (1 min) Hello everyone, I am currently using a PPO implementation (using pytorch) that has the actor and critic networks defined under one class (Class ActorCritic(nn.Module)), where each network is defined using nn.sequential (self.actor = nn.sequential ... and self.critic=nn.sequential). Is there a way to define both networks in the same class but using the forward method instead of nn.sequential? submitted by /u/AhmedNizam_ [link] [comments]

2022-01-06

  • cs.LG updates on arXiv.org

    Hierarchical Learning to Solve Partial Differential Equations Using Physics-Informed Neural Networks. (arXiv:2112.01254v2 [cs.LG] UPDATED)
    (2 min) The neural network-based approach to solving partial differential equations has attracted considerable attention due to its simplicity and flexibility in representing the solution of the partial differential equation. In training a neural network, the network learns global features corresponding to low-frequency components while high-frequency components are approximated at a much slower rate. For a class of equations in which the solution contains a wide range of scales, the network training process can suffer from slow convergence and low accuracy due to its inability to capture the high-frequency components. In this work, we propose a hierarchical approach to improve the convergence rate and accuracy of the neural network solution to partial differential equations. The proposed method comprises multi-training levels in which a newly introduced neural network is guided to learn the residual of the previous level approximation. By the nature of neural networks' training process, the high-level correction is inclined to capture the high-frequency components. We validate the efficiency and robustness of the proposed hierarchical approach through a suite of linear and nonlinear partial differential equations.
    Towards Deeper Deep Reinforcement Learning with Spectral Normalization. (arXiv:2106.01151v4 [cs.LG] UPDATED)
    (2 min) In computer vision and natural language processing, innovations in model architecture that increase model capacity have reliably translated into gains in performance. In stark contrast with this trend, state-of-the-art reinforcement learning (RL) algorithms often use small MLPs, and gains in performance typically originate from algorithmic innovations. It is natural to hypothesize that small datasets in RL necessitate simple models to avoid overfitting; however, this hypothesis is untested. In this paper we investigate how RL agents are affected by exchanging the small MLPs with larger modern networks with skip connections and normalization, focusing specifically on actor-critic algorithms. We empirically verify that naively adopting such architectures leads to instabilities and poor performance, likely contributing to the popularity of simple models in practice. However, we show that dataset size is not the limiting factor, and instead argue that instability from taking gradients through the critic is the culprit. We demonstrate that spectral normalization (SN) can mitigate this issue and enable stable training with large modern architectures. After smoothing with SN, larger models yield significant performance improvements -- suggesting that more "easy" gains may be had by focusing on model architectures in addition to algorithmic innovations.
    Adaptive and Multiple Time-scale Eligibility Traces for Online Deep Reinforcement Learning. (arXiv:2008.10040v2 [cs.RO] UPDATED)
    (2 min) Deep reinforcement learning (DRL) is one promising approach to teaching robots to perform complex tasks. Because methods that directly reuse the stored experience data cannot follow the change of the environment in robotic problems with a time-varying environment, online DRL is required. The eligibility traces method is well known as an online learning technique for improving sample efficiency in traditional reinforcement learning with linear regressors rather than DRL. The dependency between parameters of deep neural networks would destroy the eligibility traces, which is why they are not integrated with DRL. Although replacing the gradient with the most influential one rather than accumulating the gradients as the eligibility traces can alleviate this problem, the replacing operation reduces the number of reuses of previous experiences. To address these issues, this study proposes a new eligibility traces method that can be used even in DRL while maintaining high sample efficiency. When the accumulated gradients differ from those computed using the latest parameters, the proposed method takes into account the divergence between the past and latest parameters to adaptively decay the eligibility traces. Bregman divergences between outputs computed by the past and latest parameters are exploited due to the infeasible computational cost of the divergence between the past and latest parameters. In addition, a generalized method with multiple time-scale traces is designed for the first time. This design allows for the replacement of the most influential adaptively accumulated (decayed) eligibility traces.
    Dynamic Object Removal and Spatio-Temporal RGB-D Inpainting via Geometry-Aware Adversarial Learning. (arXiv:2008.05058v4 [cs.CV] UPDATED)
    (2 min) Dynamic objects have a significant impact on the robot's perception of the environment which degrades the performance of essential tasks such as localization and mapping. In this work, we address this problem by synthesizing plausible color, texture and geometry in regions occluded by dynamic objects. We propose the novel geometry-aware DynaFill architecture that follows a coarse-to-fine topology and incorporates our gated recurrent feedback mechanism to adaptively fuse information from previous timesteps. We optimize our architecture using adversarial training to synthesize fine realistic textures which enables it to hallucinate color and depth structure in occluded regions online in a spatially and temporally coherent manner, without relying on future frame information. Casting our inpainting problem as an image-to-image translation task, our model also corrects regions correlated with the presence of dynamic objects in the scene, such as shadows or reflections. We introduce a large-scale hyperrealistic dataset with RGB-D images, semantic segmentation labels, camera poses as well as groundtruth RGB-D information of occluded regions. Extensive quantitative and qualitative evaluations show that our approach achieves state-of-the-art performance, even in challenging weather conditions. Furthermore, we present results for retrieval-based visual localization with the synthesized images that demonstrate the utility of our approach.
    Interpretable Low-Resource Legal Decision Making. (arXiv:2201.01164v1 [cs.LG])
    (2 min) Over the past several years, legal applications of deep learning have been on the rise. However, as with other high-stakes decision making areas, the requirement for interpretability is of crucial importance. Current models utilized by legal practitioners are more of the conventional machine learning type, wherein they are inherently interpretable, yet unable to harness the performance capabilities of data-driven deep learning models. In this work, we utilize deep learning models in the area of trademark law to shed light on the issue of likelihood of confusion between trademarks. Specifically, we introduce a model-agnostic interpretable intermediate layer, a technique which proves to be effective for legal documents. Furthermore, we utilize weakly supervised learning by means of a curriculum learning strategy, effectively demonstrating the improved performance of a deep learning model. This is in contrast to the conventional models which are only able to utilize the limited number of expensive manually-annotated samples by legal experts. Although the methods presented in this work tackles the task of risk of confusion for trademarks, it is straightforward to extend them to other fields of law, or more generally, to other similar high-stakes application scenarios.
    Trusting Machine Learning Results from Medical Procedures in the Operating Room. (arXiv:2201.01060v1 [cs.LG])
    (2 min) Machine learning can be used to analyse physiological data for several purposes. Detection of cerebral ischemia is an achievement that would have high impact on patient care. We attempted to study if collection of continous physiological data from non-invasive monitors, and analysis with machine learning could detect cerebral ischemia in tho different setting, during surgery for carotid endarterectomy and during endovascular thrombectomy in acute stroke. We compare the results from the two different group and one patient from each group in details. While results from CEA-patients are consistent, those from thrombectomy patients are not and frequently contain extreme values such as 1.0 in accuracy. We conlcude that this is a result of short duration of the procedure and abundance of data with bad quality resulting in small data sets. These results can therefore not be trusted.
    Neural Transfer Learning for Repairing Security Vulnerabilities in C Code. (arXiv:2104.08308v3 [cs.SE] UPDATED)
    (2 min) In this paper, we address the problem of automatic repair of software vulnerabilities with deep learning. The major problem with data-driven vulnerability repair is that the few existing datasets of known confirmed vulnerabilities consist of only a few thousand examples. However, training a deep learning model often requires hundreds of thousands of examples. In this work, we leverage the intuition that the bug fixing task and the vulnerability fixing task are related and that the knowledge learned from bug fixes can be transferred to fixing vulnerabilities. In the machine learning community, this technique is called transfer learning. In this paper, we propose an approach for repairing security vulnerabilities named VRepair which is based on transfer learning. VRepair is first trained on a large bug fix corpus and is then tuned on a vulnerability fix dataset, which is an order of magnitude smaller. In our experiments, we show that a model trained only on a bug fix corpus can already fix some vulnerabilities. Then, we demonstrate that transfer learning improves the ability to repair vulnerable C functions. We also show that the transfer learning model performs better than a model trained with a denoising task and fine-tuned on the vulnerability fixing task. To sum up, this paper shows that transfer learning works well for repairing security vulnerabilities in C compared to learning on a small dataset.
    Differentially Private Transferrable Deep Learning with Membership-Mappings. (arXiv:2105.04615v3 [cs.LG] UPDATED)
    (2 min) This paper considers the problem of differentially private semi-supervised transfer and multi-task learning. The notion of \emph{membership-mapping} has been developed using measure theory basis to learn data representation via a fuzzy membership function. An alternative conception of deep autoencoder, referred to as \emph{Conditionally Deep Membership-Mapping Autoencoder (CDMMA)}, is considered for transferrable deep learning. Under practice-oriented settings, an analytical solution for the learning of CDMMA can be derived by means of variational optimization. The paper proposes a transfer and multi-task learning approach that combines CDMMA with a tailored noise adding mechanism to achieve a given level of privacy-loss bound with the minimum perturbation of the data. Numerous experiments were carried out using MNIST, USPS, Office, and Caltech256 datasets to verify the competitive robust performance of the proposed methodology.
    On the Minimal Adversarial Perturbation for Deep Neural Networks with Provable Estimation Error. (arXiv:2201.01235v1 [cs.LG])
    (2 min) Although Deep Neural Networks (DNNs) have shown incredible performance in perceptive and control tasks, several trustworthy issues are still open. One of the most discussed topics is the existence of adversarial perturbations, which has opened an interesting research line on provable techniques capable of quantifying the robustness of a given input. In this regard, the Euclidean distance of the input from the classification boundary denotes a well-proved robustness assessment as the minimal affordable adversarial perturbation. Unfortunately, computing such a distance is highly complex due the non-convex nature of NNs. Despite several methods have been proposed to address this issue, to the best of our knowledge, no provable results have been presented to estimate and bound the error committed. This paper addresses this issue by proposing two lightweight strategies to find the minimal adversarial perturbation. Differently from the state-of-the-art, the proposed approach allows formulating an error estimation theory of the approximate distance with respect to the theoretical one. Finally, a substantial set of experiments is reported to evaluate the performance of the algorithms and support the theoretical findings. The obtained results show that the proposed strategies approximate the theoretical distance for samples close to the classification boundary, leading to provable robustness guarantees against any adversarial attacks.
    COVID-19 Disease Progression Prediction via Audio Signals: A Longitudinal Study. (arXiv:2201.01232v1 [cs.SD])
    (2 min) Recent work has shown the potential of the use of audio data in screening for COVID-19. However, very little exploration has been done of monitoring disease progression, especially recovery in COVID-19 through audio. Tracking disease progression characteristics and patterns of recovery could lead to tremendous insights and more timely treatment or treatment adjustment, as well as better resources management in health care systems. The primary objective of this study is to explore the potential of longitudinal audio dynamics for COVID-19 monitoring using sequential deep learning techniques, focusing on prediction of disease progression and, especially, recovery trend prediction. We analysed crowdsourced respiratory audio data from 212 individuals over 5 days to 385 days, alongside their self-reported COVID-19 test results. We first explore the benefits of capturing longitudinal dynamics of audio biomarkers for COVID-19 detection. The strong performance, yielding an AUC-ROC of 0.79, sensitivity of 0.75 and specificity of 0.70, supports the effectiveness of the approach compared to methods that do not leverage longitudinal dynamics. We further examine the predicted disease progression trajectory, which displays high consistency with the longitudinal test results with a correlation of 0.76 in the test cohort, and 0.86 in a subset of the test cohort with 12 participants who report disease recovery. Our findings suggest that monitoring COVID-19 progression via longitudinal audio data has enormous potential in the tracking of individuals' disease progression and recovery.
    Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections. (arXiv:2106.15427v2 [stat.ML] UPDATED)
    (2 min) The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem.
    Adaptive Template Enhancement for Improved Person Recognition using Small Datasets. (arXiv:2201.01218v1 [eess.SP])
    (2 min) A novel instance-based method for the classification of electroencephalography (EEG) signals is presented and evaluated in this paper. The non-stationary nature of the EEG signals, coupled with the demanding task of pattern recognition with limited training data as well as the potentially noisy signal acquisition conditions, have motivated the work reported in this study. The proposed adaptive template enhancement mechanism transforms the feature-level instances by treating each feature dimension separately, hence resulting in improved class separation and better query-class matching. The proposed new instance-based learning algorithm is compared with a few related algorithms in a number of scenarios. A clinical grade 64-electrode EEG database, as well as a low-quality (high-noise level) EEG database obtained with a low-cost system using a single dry sensor have been used for evaluations in biometric person recognition. The proposed approach demonstrates significantly improved classification accuracy in both identification and verification scenarios. In particular, this new method is seen to provide a good classification performance for noisy EEG data, indicating its potential suitability for a wide range of applications.
    Survey on the Convergence of Machine Learning and Blockchain. (arXiv:2201.00976v1 [cs.LG])
    (2 min) Machine learning (ML) has been pervasively researched nowadays and it has been applied in many aspects of real life. Nevertheless, issues of model and data still accompany the development of ML. For instance, training of traditional ML models is limited to the access of data sets, which are generally proprietary; published ML models may soon be out of date without update of new data and continuous training; malicious data contributors may upload wrongly labeled data that leads to undesirable training results; and the abuse of private data and data leakage also exit. With the utilization of blockchain, an emerging and swiftly developing technology, these problems can be efficiently solved. In this paper, we conduct a survey of the convergence of collaborative ML and blockchain. We investigate different ways of combination of these two technologies, and their fields of application. We also discuss the limitations of current research and their future directions.
    FROTE: Feedback Rule-Driven Oversampling for Editing Models. (arXiv:2201.01070v1 [cs.LG])
    (2 min) Machine learning models may involve decision boundaries that change over time due to updates to rules and regulations, such as in loan approvals or claims management. However, in such scenarios, it may take time for sufficient training data to accumulate in order to retrain the model to reflect the new decision boundaries. While work has been done to reinforce existing decision boundaries, very little has been done to cover these scenarios where decision boundaries of the ML models should change in order to reflect new rules. In this paper, we focus on user-provided feedback rules as a way to expedite the ML models update process, and we formally introduce the problem of pre-processing training data to edit an ML model in response to feedback rules such that once the model is retrained on the pre-processed data, its decision boundaries align more closely with the rules. To solve this problem, we propose a novel data augmentation method, the Feedback Rule-Based Oversampling Technique. Extensive experiments using different ML models and real world datasets demonstrate the effectiveness of the method, in particular the benefit of augmentation and the ability to handle many feedback rules.
    DeepVisualInsight: Time-Travelling Visualization for Spatio-Temporal Causality of Deep Classification Training. (arXiv:2201.01155v1 [cs.LG])
    (2 min) Understanding how the predictions of deep learning models are formed during the training process is crucial to improve model performance and fix model defects, especially when we need to investigate nontrivial training strategies such as active learning, and track the root cause of unexpected training results such as performance degeneration. In this work, we propose a time-travelling visual solution DeepVisualInsight (DVI), aiming to manifest the spatio-temporal causality while training a deep learning image classifier. The spatio-temporal causality demonstrates how the gradient-descent algorithm and various training data sampling techniques can influence and reshape the layout of learnt input representation and the classification boundaries in consecutive epochs. Such causality allows us to observe and analyze the whole learning process in the visible low dimensional space. Technically, we propose four spatial and temporal properties and design our visualization solution to satisfy them. These properties preserve the most important information when inverse-)projecting input samples between the visible low-dimensional and the invisible high-dimensional space, for causal analyses. Our extensive experiments show that, comparing to baseline approaches, we achieve the best visualization performance regarding the spatial/temporal properties and visualization efficiency. Moreover, our case study shows that our visual solution can well reflect the characteristics of various training scenarios, showing good potential of DVI as a debugging tool for analyzing deep learning training processes.
    Discovering Diverse Nearly Optimal Policies with Successor Features. (arXiv:2106.00669v2 [cs.AI] UPDATED)
    (2 min) Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.
    Biased Hypothesis Formation From Projection Pursuit. (arXiv:2201.00889v1 [cs.LG])
    (2 min) The effect of bias on hypothesis formation is characterized for an automated data-driven projection pursuit neural network to extract and select features for binary classification of data streams. This intelligent exploratory process partitions a complete vector state space into disjoint subspaces to create working hypotheses quantified by similarities and differences observed between two groups of labeled data streams. Data streams are typically time sequenced, and may exhibit complex spatio-temporal patterns. For example, given atomic trajectories from molecular dynamics simulation, the machine's task is to quantify dynamical mechanisms that promote function by comparing protein mutants, some known to function while others are nonfunctional. Utilizing synthetic two-dimensional molecules that mimic the dynamics of functional and nonfunctional proteins, biases are identified and controlled in both the machine learning model and selected training data under different contexts. The refinement of a working hypothesis converges to a statistically robust multivariate perception of the data based on a context-dependent perspective. Including diverse perspectives during data exploration enhances interpretability of the multivariate characterization of similarities and differences.
    Machine Learning-Based COVID-19 Patients Triage Algorithm using Patient-Generated Health Data from Nationwide Multicenter Database. (arXiv:2109.09001v3 [cs.LG] UPDATED)
    (2 min) A model that can immediately measure the severity of an epidemic would be of great help to the medical system. In this paper, we will use machine learning to create and analyze a model that can immediately measure the severity of SARS-CoV-2 patients. Since our model uses nationwide dataset that consists of basic personal information of patients, it is of great significance for the patient to be able to check their own severity. Moreover, based on the predicted severity score, we use our model to inform the patients to visit the appropriate clinic center.
    Homological Time Series Analysis of Sensor Signals from Power Plants. (arXiv:2106.02493v4 [cs.LG] UPDATED)
    (2 min) In this paper, we use topological data analysis techniques to construct a suitable neural network classifier for the task of learning sensor signals of entire power plants according to their reference designation system. We use representations of persistence diagrams to derive necessary preprocessing steps and visualize the large amounts of data. We derive deep architectures with one-dimensional convolutional layers combined with stacked long short-term memories as residual networks suitable for processing the persistence features. We combine three separate sub-networks, obtaining as input the time series itself and a representation of the persistent homology for the zeroth and first dimension. We give a mathematical derivation for most of the used hyper-parameters. For validation, numerical experiments were performed with sensor data from four power plants of the same construction type.
    CEMENT: Incomplete Multi-View Weak-Label Learning with Long Tail Labels. (arXiv:2201.01079v1 [cs.LG])
    (2 min) A variety of modern applications exhibit multi-view multi-label learning, where each sample has multi-view features, and multiple labels are correlated via common views. In recent years, several methods have been proposed to cope with it and achieve much success, but still suffer from two key problems: 1) lack the ability to deal with the incomplete multi-view weak-label data, in which only a subset of features and labels are provided for each sample; 2) ignore the presence of noisy views and tail labels usually occurring in real-world problems. In this paper, we propose a novel method, named CEMENT, to overcome the limitations. For 1), CEMENT jointly embeds incomplete views and weak labels into distinct low-dimensional subspaces, and then correlates them via Hilbert-Schmidt Independence Criterion (HSIC). For 2), CEMEMT adaptively learns the weights of embeddings to capture noisy views, and explores an additional sparse component to model tail labels, making the low-rankness available in the multi-label setting. We develop an alternating algorithm to solve the proposed optimization problem. Experimental results on seven real-world datasets demonstrate the effectiveness of the proposed method.
    Classifying Autism from Crowdsourced Semi-Structured Speech Recordings: A Machine Learning Approach. (arXiv:2201.00927v1 [cs.SD])
    (2 min) Autism spectrum disorder (ASD) is a neurodevelopmental disorder which results in altered behavior, social development, and communication patterns. In past years, autism prevalence has tripled, with 1 in 54 children now affected. Given that traditional diagnosis is a lengthy, labor-intensive process, significant attention has been given to developing systems that automatically screen for autism. Prosody abnormalities are among the clearest signs of autism, with affected children displaying speech idiosyncrasies including echolalia, monotonous intonation, atypical pitch, and irregular linguistic stress patterns. In this work, we present a suite of machine learning approaches to detect autism in self-recorded speech audio captured from autistic and neurotypical (NT) children in home environments. We consider three methods to detect autism in child speech: first, Random Forests trained on extracted audio features (including Mel-frequency cepstral coefficients); second, convolutional neural networks (CNNs) trained on spectrograms; and third, fine-tuned wav2vec 2.0--a state-of-the-art Transformer-based ASR model. We train our classifiers on our novel dataset of cellphone-recorded child speech audio curated from Stanford's Guess What? mobile game, an app designed to crowdsource videos of autistic and neurotypical children in a natural home environment. The Random Forest classifier achieves 70% accuracy, the fine-tuned wav2vec 2.0 model achieves 77% accuracy, and the CNN achieves 79% accuracy when classifying children's audio as either ASD or NT. Our models were able to predict autism status when training on a varied selection of home audio clips with inconsistent recording quality, which may be more generalizable to real world conditions. These results demonstrate that machine learning methods offer promise in detecting autism automatically from speech without specialized equipment.
    Towards Fair Recommendation in Two-Sided Platforms. (arXiv:2201.01180v1 [cs.IR])
    (2 min) Many online platforms today (such as Amazon, Netflix, Spotify, LinkedIn, and AirBnB) can be thought of as two-sided markets with producers and customers of goods and services. Traditionally, recommendation services in these platforms have focused on maximizing customer satisfaction by tailoring the results according to the personalized preferences of individual customers. However, our investigation reinforces the fact that such customer-centric design of these services may lead to unfair distribution of exposure to the producers, which may adversely impact their well-being. On the other hand, a pure producer-centric design might become unfair to the customers. As more and more people are depending on such platforms to earn a living, it is important to ensure fairness to both producers and customers. In this work, by mapping a fair personalized recommendation problem to a constrained version of the problem of fairly allocating indivisible goods, we propose to provide fairness guarantees for both sides. Formally, our proposed {\em FairRec} algorithm guarantees Maxi-Min Share ($\alpha$-MMS) of exposure for the producers, and Envy-Free up to One Item (EF1) fairness for the customers. Extensive evaluations over multiple real-world datasets show the effectiveness of {\em FairRec} in ensuring two-sided fairness while incurring a marginal loss in overall recommendation quality. Finally, we present a modification of FairRec (named as FairRecPlus) that at the cost of additional computation time, improves the recommendation performance for the customers, while maintaining the same fairness guarantees.
    Nipping in the Bud: Detection, Diffusion and Mitigation of Hate Speech on Social Media. (arXiv:2201.00961v1 [cs.SI])
    (2 min) Since the proliferation of social media usage, hate speech has become a major crisis. Hateful content can spread quickly and create an environment of distress and hostility. Further, what can be considered hateful is contextual and varies with time. While online hate speech reduces the ability of already marginalised groups to participate in discussion freely, offline hate speech leads to hate crimes and violence against individuals and communities. The multifaceted nature of hate speech and its real-world impact have already piqued the interest of the data mining and machine learning communities. Despite our best efforts, hate speech remains an evasive issue for researchers and practitioners alike. This article presents methodological challenges that hinder building automated hate mitigation systems. These challenges inspired our work in the broader area of combating hateful content on the web. We discuss a series of our proposed solutions to limit the spread of hate speech on social media.
    CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model. (arXiv:2201.01018v1 [q-bio.GN])
    (2 min) Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Although there are experimental methods for host identification, they are either labor-intensive or require the cultivation of the host cells, creating a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43% accuracy at the species level. This work presents CHERRY, a tool formulating host prediction as link prediction in a knowledge graph. As a virus-prokaryotic interaction prediction tool, CHERRY can be applied to predict hosts for newly discovered viruses and also the viruses infecting antibiotic-resistant bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with the state-of-the-art methods in different scenarios. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37%. In addition, CHERRY's performance is more stable on short contigs than other tools.
    Extending CLIP for Category-to-image Retrieval in E-commerce. (arXiv:2112.11294v2 [cs.IR] UPDATED)
    (2 min) E-commerce provides rich multimodal data that is barely leveraged in practice. One aspect of this data is a category tree that is being used in search and recommendation. However, in practice, during a user's session there is often a mismatch between a textual and a visual representation of a given category. Motivated by the problem, we introduce the task of category-to-image retrieval in e-commerce and propose a model for the task, CLIP-ITA. The model leverages information from multiple modalities (textual, visual, and attribute modality) to create product representations. We explore how adding information from multiple modalities (textual, visual, and attribute modality) impacts the model's performance. In particular, we observe that CLIP-ITA significantly outperforms a comparable model that leverages only the visual modality and a comparable model that leverages the visual and attribute modality.
    Aligning Domain-specific Distribution and Classifier for Cross-domain Classification from Multiple Sources. (arXiv:2201.01003v1 [cs.LG])
    (2 min) While Unsupervised Domain Adaptation (UDA) algorithms, i.e., there are only labeled data from source domains, have been actively studied in recent years, most algorithms and theoretical results focus on Single-source Unsupervised Domain Adaptation (SUDA). However, in the practical scenario, labeled data can be typically collected from multiple diverse sources, and they might be different not only from the target domain but also from each other. Thus, domain adapters from multiple sources should not be modeled in the same way. Recent deep learning based Multi-source Unsupervised Domain Adaptation (MUDA) algorithms focus on extracting common domain-invariant representations for all domains by aligning distribution of all pairs of source and target domains in a common feature space. However, it is often very hard to extract the same domain-invariant representations for all domains in MUDA. In addition, these methods match distributions without considering domain-specific decision boundaries between classes. To solve these problems, we propose a new framework with two alignment stages for MUDA which not only respectively aligns the distributions of each pair of source and target domains in multiple specific feature spaces, but also aligns the outputs of classifiers by utilizing the domain-specific decision boundaries. Extensive experiments demonstrate that our method can achieve remarkable results on popular benchmark datasets for image classification.
    A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks. (arXiv:2201.01089v1 [cs.AR])
    (2 min) Deployment of modern TinyML tasks on small battery-constrained IoT devices requires high computational energy efficiency. Analog In-Memory Computing (IMC) using non-volatile memory (NVM) promises major efficiency improvements in deep neural network (DNN) inference and serves as on-chip memory storage for DNN weights. However, IMC's functional flexibility limitations and their impact on performance, energy, and area efficiency are not yet fully understood at the system level. To target practical end-to-end IoT applications, IMC arrays must be enclosed in heterogeneous programmable systems, introducing new system-level challenges which we aim at addressing in this work. We present a heterogeneous tightly-coupled clustered architecture integrating 8 RISC-V cores, an in-memory computing accelerator (IMA), and digital accelerators. We benchmark the system on a highly heterogeneous workload such as the Bottleneck layer from a MobileNetV2, showing 11.5x performance and 9.5x energy efficiency improvements, compared to highly optimized parallel execution on the cores. Furthermore, we explore the requirements for end-to-end inference of a full mobile-grade DNN (MobileNetV2) in terms of IMC array resources, by scaling up our heterogeneous architecture to a multi-array accelerator. Our results show that our solution, on the end-to-end inference of the MobileNetV2, is one order of magnitude better in terms of execution latency than existing programmable architectures and two orders of magnitude better than state-of-the-art heterogeneous solutions integrating in-memory computing analog cores.
    Graph Neural Networks With Lifting-based Adaptive Graph Wavelets. (arXiv:2108.01660v3 [cs.LG] UPDATED)
    (2 min) Spectral-based graph neural networks (SGNNs) have been attracting increasing attention in graph representation learning. However, existing SGNNs are limited in implementing graph filters with rigid transforms (e.g., graph Fourier or predefined graph wavelet transforms) and cannot adapt to signals residing on graphs and tasks at hand. In this paper, we propose a novel class of graph neural networks that realizes graph filters with adaptive graph wavelets. Specifically, the adaptive graph wavelets are learned with neural network-parameterized lifting structures, where structure-aware attention-based lifting operations (i.e., prediction and update operations) are developed to jointly consider graph structures and node features. We propose to lift based on diffusion wavelets to alleviate the structural information loss induced by partitioning non-bipartite graphs. By design, the locality and sparsity of the resulting wavelet transform as well as the scalability of the lifting structure are guaranteed. We further derive a soft-thresholding filtering operation by learning sparse graph representations in terms of the learned wavelets, yielding a localized, efficient, and scalable wavelet-based graph filters. To ensure that the learned graph representations are invariant to node permutations, a layer is employed at the input of the networks to reorder the nodes according to their local topology information. We evaluate the proposed networks in both node-level and graph-level representation learning tasks on benchmark citation and bioinformatics graph datasets. Extensive experiments demonstrate the superiority of the proposed networks over existing SGNNs in terms of accuracy, efficiency, and scalability.
    Submix: Practical Private Prediction for Large-Scale Language Models. (arXiv:2201.00971v1 [cs.LG])
    (2 min) Recent data-extraction attacks have exposed that language models can memorize some training samples verbatim. This is a vulnerability that can compromise the privacy of the model's training data. In this work, we introduce SubMix: a practical protocol for private next-token prediction designed to prevent privacy violations by language models that were fine-tuned on a private corpus after pre-training on a public corpus. We show that SubMix limits the leakage of information that is unique to any individual user in the private corpus via a relaxation of group differentially private prediction. Importantly, SubMix admits a tight, data-dependent privacy accounting mechanism, which allows it to thwart existing data-extraction attacks while maintaining the utility of the language model. SubMix is the first protocol that maintains privacy even when publicly releasing tens of thousands of next-token predictions made by large transformer-based models such as GPT-2.
    Rxn Hypergraph: a Hypergraph Attention Model for Chemical Reaction Representation. (arXiv:2201.01196v1 [cs.LG])
    (0 min) It is fundamental for science and technology to be able to predict chemical reactions and their properties. To achieve such skills, it is important to develop good representations of chemical reactions, or good deep learning architectures that can learn such representations automatically from the data. There is currently no universal and widely adopted method for robustly representing chemical reactions. Most existing methods suffer from one or more drawbacks, such as: (1) lacking universality; (2) lacking robustness; (3) lacking interpretability; or (4) requiring excessive manual pre-processing. Here we exploit graph-based representations of molecular structures to develop and test a hypergraph attention neural network approach to solve at once the reaction representation and property-prediction problems, alleviating the aforementioned drawbacks. We evaluate this hypergraph representation in three experiments using three independent data sets of chemical reactions. In all experiments, the hypergraph-based approach matches or outperforms other representations and their corresponding models of chemical reactions while yielding interpretable multi-level representations.
    Finding General Equilibria in Many-Agent Economic Simulations Using Deep Reinforcement Learning. (arXiv:2201.01163v1 [cs.GT])
    (0 min) Real economies can be seen as a sequential imperfect-information game with many heterogeneous, interacting strategic agents of various agent types, such as consumers, firms, and governments. Dynamic general equilibrium models are common economic tools to model the economic activity, interactions, and outcomes in such systems. However, existing analytical and computational methods struggle to find explicit equilibria when all agents are strategic and interact, while joint learning is unstable and challenging. Amongst others, a key reason is that the actions of one economic agent may change the reward function of another agent, e.g., a consumer's expendable income changes when firms change prices or governments change taxes. We show that multi-agent deep reinforcement learning (RL) can discover stable solutions that are epsilon-Nash equilibria for a meta-game over agent types, in economic simulations with many agents, through the use of structured learning curricula and efficient GPU-only simulation and training. Conceptually, our approach is more flexible and does not need unrealistic assumptions, e.g., market clearing, that are commonly used for analytical tractability. Our GPU implementation enables training and analyzing economies with a large number of agents within reasonable time frames, e.g., training completes within a day. We demonstrate our approach in real-business-cycle models, a representative family of DGE models, with 100 worker-consumers, 10 firms, and a government who taxes and redistributes. We validate the learned meta-game epsilon-Nash equilibria through approximate best-response analyses, show that RL policies align with economic intuitions, and that our approach is constructive, e.g., by explicitly learning a spectrum of meta-game epsilon-Nash equilibria in open RBC models.
    Positional Encoding Augmented GAN for the Assessment of Wind Flow for Pedestrian Comfort in Urban Areas. (arXiv:2112.08447v2 [cs.CV] UPDATED)
    (0 min) Approximating wind flows using computational fluid dynamics (CFD) methods can be time-consuming. Creating a tool for interactively designing prototypes while observing the wind flow change requires simpler models to simulate faster. Instead of running numerical approximations resulting in detailed calculations, data-driven methods and deep learning might be able to give similar results in a fraction of the time. This work rephrases the problem from computing 3D flow fields using CFD to a 2D image-to-image translation-based problem on the building footprints to predict the flow field at pedestrian height level. We investigate the use of generative adversarial networks (GAN), such as Pix2Pix [1] and CycleGAN [2] representing state-of-the-art for image-to-image translation task in various domains as well as U-Net autoencoder [3]. The models can learn the underlying distribution of a dataset in a data-driven manner, which we argue can help the model learn the underlying Reynolds-averaged Navier-Stokes (RANS) equations from CFD. We experiment on novel simulated datasets on various three-dimensional bluff-shaped buildings with and without height information. Moreover, we present an extensive qualitative and quantitative evaluation of the generated images for a selection of models and compare their performance with the simulations delivered by CFD. We then show that adding positional data to the input can produce more accurate results by proposing a general framework for injecting such information on the different architectures. Furthermore, we show that the models performances improve by applying attention mechanisms and spectral normalization to facilitate stable training.
    Quantifying Uncertainty in Deep Learning Approaches to Radio Galaxy Classification. (arXiv:2201.01203v1 [astro-ph.CO])
    (0 min) In this work we use variational inference to quantify the degree of uncertainty in deep learning model predictions of radio galaxy classification. We show that the level of model posterior variance for individual test samples is correlated with human uncertainty when labelling radio galaxies. We explore the model performance and uncertainty calibration for a variety of different weight priors and suggest that a sparse prior produces more well-calibrated uncertainty estimates. Using the posterior distributions for individual weights, we show that we can prune 30% of the fully-connected layer weights without significant loss of performance by removing the weights with the lowest signal-to-noise ratio (SNR). We demonstrate that a larger degree of pruning can be achieved using a Fisher information based ranking, but we note that both pruning methods affect the uncertainty calibration for Fanaroff-Riley type I and type II radio galaxies differently. Finally we show that, like other work in this field, we experience a cold posterior effect, whereby the posterior must be down-weighted to achieve good predictive performance. We examine whether adapting the cost function to accommodate model misspecification can compensate for this effect, but find that it does not make a significant difference. We also examine the effect of principled data augmentation and find that this improves upon the baseline but also does not compensate for the observed effect. We interpret this as the cold posterior effect being due to the overly effective curation of our training sample leading to likelihood misspecification, and raise this as a potential issue for Bayesian deep learning approaches to radio galaxy classification in future.
    Delving into Sample Loss Curve to Embrace Noisy and Imbalanced Data. (arXiv:2201.00849v1 [cs.LG])
    (2 min) Corrupted labels and class imbalance are commonly encountered in practically collected training data, which easily leads to over-fitting of deep neural networks (DNNs). Existing approaches alleviate these issues by adopting a sample re-weighting strategy, which is to re-weight sample by designing weighting function. However, it is only applicable for training data containing only either one type of data biases. In practice, however, biased samples with corrupted labels and of tailed classes commonly co-exist in training data. How to handle them simultaneously is a key but under-explored problem. In this paper, we find that these two types of biased samples, though have similar transient loss, have distinguishable trend and characteristics in loss curves, which could provide valuable priors for sample weight assignment. Motivated by this, we delve into the loss curves and propose a novel probe-and-allocate training strategy: In the probing stage, we train the network on the whole biased training data without intervention, and record the loss curve of each sample as an additional attribute; In the allocating stage, we feed the resulting attribute to a newly designed curve-perception network, named CurveNet, to learn to identify the bias type of each sample and assign proper weights through meta-learning adaptively. The training speed of meta learning also blocks its application. To solve it, we propose a method named skip layer meta optimization (SLMO) to accelerate training speed by skipping the bottom layers. Extensive synthetic and real experiments well validate the proposed method, which achieves state-of-the-art performance on multiple challenging benchmarks.
    Interactive Attention AI to translate low light photos to captions for night scene understanding in women safety. (arXiv:2201.00969v1 [cs.CV])
    (0 min) There is amazing progress in Deep Learning based models for Image captioning and Low Light image enhancement. For the first time in literature, this paper develops a Deep Learning model that translates night scenes to sentences, opening new possibilities for AI applications in the safety of visually impaired women. Inspired by Image Captioning and Visual Question Answering, a novel Interactive Image Captioning is developed. A user can make the AI focus on any chosen person of interest by influencing the attention scoring. Attention context vectors are computed from CNN feature vectors and user-provided start word. The Encoder-Attention-Decoder neural network learns to produce captions from low brightness images. This paper demonstrates how women safety can be enabled by researching a novel AI capability in the Interactive Vision-Language model for perception of the environment in the night.
    Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions. (arXiv:2107.13782v3 [cs.LG] UPDATED)
    (0 min) Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. Multimodal machine learning involves multiple aspects: representation, translation, alignment, fusion, and co-learning. In the current state of multimodal machine learning, the assumptions are that all modalities are present, aligned, and noiseless during training and testing time. However, in real-world tasks, typically, it is observed that one or more modalities are missing, noisy, lacking annotated data, have unreliable labels, and are scarce in training or testing and or both. This challenge is addressed by a learning paradigm called multimodal co-learning. The modeling of a (resource-poor) modality is aided by exploiting knowledge from another (resource-rich) modality using transfer of knowledge between modalities, including their representations and predictive models. Co-learning being an emerging area, there are no dedicated reviews explicitly focusing on all challenges addressed by co-learning. To that end, in this work, we provide a comprehensive survey on the emerging area of multimodal co-learning that has not been explored in its entirety yet. We review implementations that overcome one or more co-learning challenges without explicitly considering them as co-learning challenges. We present the comprehensive taxonomy of multimodal co-learning based on the challenges addressed by co-learning and associated implementations. The various techniques employed to include the latest ones are reviewed along with some of the applications and datasets. Our final goal is to discuss challenges and perspectives along with the important ideas and directions for future work that we hope to be beneficial for the entire research community focusing on this exciting domain.
    Super-resolution in Molecular Dynamics Trajectory Reconstruction with Bi-Directional Neural Networks. (arXiv:2201.01195v1 [physics.comp-ph])
    (0 min) Molecular dynamics simulations are a cornerstone in science, allowing to investigate from the system's thermodynamics to analyse intricate molecular interactions. In general, to create extended molecular trajectories can be a computationally expensive process, for example, when running $ab-initio$ simulations. Hence, repeating such calculations to either obtain more accurate thermodynamics or to get a higher resolution in the dynamics generated by a fine-grained quantum interaction can be time- and computationally-consuming. In this work, we explore different machine learning (ML) methodologies to increase the resolution of molecular dynamics trajectories on-demand within a post-processing step. As a proof of concept, we analyse the performance of bi-directional neural networks such as neural ODEs, Hamiltonian networks, recurrent neural networks and LSTMs, as well as the uni-directional variants as a reference, for molecular dynamics simulations (here: the MD17 dataset). We have found that Bi-LSTMs are the best performing models; by utilizing the local time-symmetry of thermostated trajectories they can even learn long-range correlations and display high robustness to noisy dynamics across molecular complexity. Our models can reach accuracies of up to 10$^{-4}$ angstroms in trajectory interpolation, while faithfully reconstructing several full cycles of unseen intricate high-frequency molecular vibrations, rendering the comparison between the learned and reference trajectories indistinguishable. The results reported in this work can serve (1) as a baseline for larger systems, as well as (2) for the construction of better MD integrators.
    Runway Extraction and Improved Mapping from Space Imagery. (arXiv:2201.00848v1 [cs.CV])
    (0 min) Change detection methods applied to monitoring key infrastructure like airport runways represent an important capability for disaster relief and urban planning. The present work identifies two generative adversarial networks (GAN) architectures that translate reversibly between plausible runway maps and satellite imagery. We illustrate the training capability using paired images (satellite-map) from the same point of view and using the Pix2Pix architecture or conditional GANs. In the absence of available pairs, we likewise show that CycleGAN architectures with four network heads (discriminator-generator pairs) can also provide effective style transfer from raw image pixels to outline or feature maps. To emphasize the runway and tarmac boundaries, we experimentally show that the traditional grey-tan map palette is not a required training input but can be augmented by higher contrast mapping palettes (red-black) for sharper runway boundaries. We preview a potentially novel use case (called "sketch2satellite") where a human roughly draws the current runway boundaries and automates the machine output of plausible satellite images. Finally, we identify examples of faulty runway maps where the published satellite and mapped runways disagree but an automated update renders the correct map using GANs.
    Multi-Representation Adaptation Network for Cross-domain Image Classification. (arXiv:2201.01002v1 [cs.CV])
    (0 min) In image classification, it is often expensive and time-consuming to acquire sufficient labels. To solve this problem, domain adaptation often provides an attractive option given a large amount of labeled data from a similar nature but different domain. Existing approaches mainly align the distributions of representations extracted by a single structure and the representations may only contain partial information, e.g., only contain part of the saturation, brightness, and hue information. Along this line, we propose Multi-Representation Adaptation which can dramatically improve the classification accuracy for cross-domain image classification and specially aims to align the distributions of multiple representations extracted by a hybrid structure named Inception Adaptation Module (IAM). Based on this, we present Multi-Representation Adaptation Network (MRAN) to accomplish the cross-domain image classification task via multi-representation alignment which can capture the information from different aspects. In addition, we extend Maximum Mean Discrepancy (MMD) to compute the adaptation loss. Our approach can be easily implemented by extending most feed-forward models with IAM, and the network can be trained efficiently via back-propagation. Experiments conducted on three benchmark image datasets demonstrate the effectiveness of MRAN. The code has been available at https://github.com/easezyc/deep-transfer-learning.
    Self-supervised representation learning from 12-lead ECG data. (arXiv:2103.12676v2 [eess.SP] UPDATED)
    (0 min) Clinical 12-lead electrocardiography (ECG) is one of the most widely encountered kinds of biosignals. Despite the increased availability of public ECG datasets, label scarcity remains a central challenge in the field. Self-supervised learning represents a promising way to alleviate this issue. In this work, we put forward the first comprehensive assessment of self-supervised representation learning from clinical 12-lead ECG data. To this end, we adapt state-of-the-art self-supervised methods based on instance discrimination and latent forecasting to the ECG domain. In a first step, we learn contrastive representations and evaluate their quality based on linear evaluation performance on a recently established, comprehensive, clinical ECG classification task. In a second step, we analyze the impact of self-supervised pretraining on finetuned ECG classifiers as compared to purely supervised performance. For the best-performing method, an adaptation of contrastive predictive coding, we find a linear evaluation performance only 0.5% below supervised performance. For the finetuned models, we find improvements in downstream performance of roughly 1% compared to supervised performance, label efficiency, as well as robustness against physiological noise. This work clearly establishes the feasibility of extracting discriminative representations from ECG data via self-supervised learning and the numerous advantages when finetuning such representations on downstream tasks as compared to purely supervised training. As first comprehensive assessment of its kind in the ECG domain carried out exclusively on publicly available datasets, we hope to establish a first step towards reproducible progress in the rapidly evolving field of representation learning for biosignals.
    Transfer Learning for Retinal Vascular Disease Detection: A Pilot Study with Diabetic Retinopathy and Retinopathy of Prematurity. (arXiv:2201.01250v1 [cs.LG])
    (0 min) Retinal vascular diseases affect the well-being of human body and sometimes provide vital signs of otherwise undetected bodily damage. Recently, deep learning techniques have been successfully applied for detection of diabetic retinopathy (DR). The main obstacle of applying deep learning techniques to detect most other retinal vascular diseases is the limited amount of data available. In this paper, we propose a transfer learning technique that aims to utilize the feature similarities for detecting retinal vascular diseases. We choose the well-studied DR detection as a source task and identify the early detection of retinopathy of prematurity (ROP) as the target task. Our experimental results demonstrate that our DR-pretrained approach dominates in all metrics the conventional ImageNet-pretrained transfer learning approach, currently adopted in medical image analysis. Moreover, our approach is more robust with respect to the stochasticity in the training process and with respect to reduced training samples. This study suggests the potential of our proposed transfer learning approach for a broad range of retinal vascular diseases or pathologies, where data is limited.
    Marginal likelihood computation for model selection and hypothesis testing: an extensive review. (arXiv:2005.08334v4 [stat.CO] UPDATED)
    (0 min) This is an up-to-date introduction to, and overview of, marginal likelihood computation for model selection and hypothesis testing. Computing normalizing constants of probability models (or ratio of constants) is a fundamental issue in many applications in statistics, applied mathematics, signal processing and machine learning. This article provides a comprehensive study of the state-of-the-art of the topic. We highlight limitations, benefits, connections and differences among the different techniques. Problems and possible solutions with the use of improper priors are also described. Some of the most relevant methodologies are compared through theoretical comparisons and numerical experiments.
    A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning. (arXiv:2201.01221v1 [cs.LG])
    (0 min) Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.
    Kendall transformation: a robust representation of continuous data for information theory. (arXiv:2006.15991v2 [cs.LG] UPDATED)
    (0 min) Kendall transformation is a conversion of an ordered feature into a vector of pairwise order relations between individual values. This way, it preserves ranking of observations and represents it in a categorical form. Such transformation allows for generalisation of methods requiring strictly categorical input, especially in the limit of small number of observations, when discretisation becomes problematic. In particular, many approaches of information theory can be directly applied to Kendall-transformed continuous data without relying on differential entropy or any additional parameters. Moreover, by filtering information to this contained in ranking, Kendall transformation leads to a better robustness at a reasonable cost of dropping sophisticated interactions which are anyhow unlikely to be correctly estimated. In bivariate analysis, Kendall transformation can be related to popular non-parametric methods, showing the soundness of the approach. The paper also demonstrates its efficiency in multivariate problems, as well as provides an example analysis of a real-world data.
    Modelling Cournot Games as Multi-agent Multi-armed Bandits. (arXiv:2201.01182v1 [cs.GT])
    (0 min) We investigate the use of a multi-agent multi-armed bandit (MA-MAB) setting for modeling repeated Cournot oligopoly games, where the firms acting as agents choose from the set of arms representing production quantity (a discrete value). Agents interact with separate and independent bandit problems. In this formulation, each agent makes sequential choices among arms to maximize its own reward. Agents do not have any information about the environment; they can only see their own rewards after taking an action. However, the market demand is a stationary function of total industry output, and random entry or exit from the market is not allowed. Given these assumptions, we found that an $\epsilon$-greedy approach offers a more viable learning mechanism than other traditional MAB approaches, as it does not require any additional knowledge of the system to operate. We also propose two novel approaches that take advantage of the ordered action space: $\epsilon$-greedy+HL and $\epsilon$-greedy+EL. These new approaches help firms to focus on more profitable actions by eliminating less profitable choices and hence are designed to optimize the exploration. We use computer simulations to study the emergence of various equilibria in the outcomes and do the empirical analysis of joint cumulative regrets.
    Efficient Quantum Feature Extraction for CNN-based Learning. (arXiv:2201.01246v1 [quant-ph])
    (0 min) Recent work has begun to explore the potential of parametrized quantum circuits (PQCs) as general function approximators. In this work, we propose a quantum-classical deep network structure to enhance classical CNN model discriminability. The convolutional layer uses linear filters to scan the input data. Moreover, we build PQC, which is a more potent function approximator, with more complex structures to capture the features within the receptive field. The feature maps are obtained by sliding the PQCs over the input in a similar way as CNN. We also give a training algorithm for the proposed model. The hybrid models used in our design are validated by numerical simulation. We demonstrate the reasonable classification performances on MNIST and we compare the performances with models in different settings. The results disclose that the model with ansatz in high expressibility achieves lower cost and higher accuracy.
    Learning Operators with Coupled Attention. (arXiv:2201.01032v1 [cs.LG])
    (0 min) Supervised operator learning is an emerging machine learning paradigm with applications to modeling the evolution of spatio-temporal dynamical systems and approximating general black-box relationships between functional data. We propose a novel operator learning method, LOCA (Learning Operators with Coupled Attention), motivated from the recent success of the attention mechanism. In our architecture, the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations. By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions, enabling us to approximate nonlinear operators even when the number of output function in the training set measurements is very small. Our formulation is accompanied by rigorous approximation theoretic guarantees on the universal expressiveness of the proposed model. Empirically, we evaluate the performance of LOCA on several operator learning scenarios involving systems governed by ordinary and partial differential equations, as well as a black-box climate prediction problem. Through these scenarios we demonstrate state of the art accuracy, robustness with respect to noisy input data, and a consistently small spread of errors over testing data sets, even for out-of-distribution prediction tasks.
    Self-directed Machine Learning. (arXiv:2201.01289v1 [cs.LG])
    (0 min) Conventional machine learning (ML) relies heavily on manual design from machine learning experts to decide learning tasks, data, models, optimization algorithms, and evaluation metrics, which is labor-intensive, time-consuming, and cannot learn autonomously like humans. In education science, self-directed learning, where human learners select learning tasks and materials on their own without requiring hands-on guidance, has been shown to be more effective than passive teacher-guided learning. Inspired by the concept of self-directed human learning, we introduce the principal concept of Self-directed Machine Learning (SDML) and propose a framework for SDML. Specifically, we design SDML as a self-directed learning process guided by self-awareness, including internal awareness and external awareness. Our proposed SDML process benefits from self task selection, self data selection, self model selection, self optimization strategy selection and self evaluation metric selection through self-awareness without human guidance. Meanwhile, the learning performance of the SDML process serves as feedback to further improve self-awareness. We propose a mathematical formulation for SDML based on multi-level optimization. Furthermore, we present case studies together with potential applications of SDML, followed by discussing future research directions. We expect that SDML could enable machines to conduct human-like self-directed learning and provide a new perspective towards artificial general intelligence.
    Deep learning architectures for nonlinear operator functions and nonlinear inverse problems. (arXiv:1912.11090v3 [math.OC] UPDATED)
    (0 min) We develop a theoretical analysis for special neural network architectures, termed operator recurrent neural networks, for approximating nonlinear functions whose inputs are linear operators. Such functions commonly arise in solution algorithms for inverse boundary value problems. Traditional neural networks treat input data as vectors, and thus they do not effectively capture the multiplicative structure associated with the linear operators that correspond to the data in such inverse problems. We therefore introduce a new family that resembles a standard neural network architecture, but where the input data acts multiplicatively on vectors. Motivated by compact operators appearing in boundary control and the analysis of inverse boundary value problems for the wave equation, we promote structure and sparsity in selected weight matrices in the network. After describing this architecture, we study its representation properties as well as its approximation properties. We furthermore show that an explicit regularization can be introduced that can be derived from the mathematical analysis of the mentioned inverse problems, and which leads to certain guarantees on the generalization properties. We observe that the sparsity of the weight matrices improves the generalization estimates. Lastly, we discuss how operator recurrent networks can be viewed as a deep learning analogue to deterministic algorithms such as boundary control for reconstructing the unknown wavespeed in the acoustic wave equation from boundary measurements.
    McXai: Local model-agnostic explanation as two games. (arXiv:2201.01044v1 [cs.LG])
    (0 min) To this day, a variety of approaches for providing local interpretability of black-box machine learning models have been introduced. Unfortunately, all of these methods suffer from one or more of the following deficiencies: They are either difficult to understand themselves, they work on a per-feature basis and ignore the dependencies between features and/or they only focus on those features asserting the decision made by the model. To address these points, this work introduces a reinforcement learning-based approach called Monte Carlo tree search for eXplainable Artificial Intelligent (McXai) to explain the decisions of any black-box classification model (classifier). Our method leverages Monte Carlo tree search and models the process of generating explanations as two games. In one game, the reward is maximized by finding feature sets that support the decision of the classifier, while in the second game, finding feature sets leading to alternative decisions maximizes the reward. The result is a human friendly representation as a tree structure, in which each node represents a set of features to be studied with smaller explanations at the top of the tree. Our experiments show, that the features found by our method are more informative with respect to classifications than those found by classical approaches like LIME and SHAP. Furthermore, by also identifying misleading features, our approach is able to guide towards improved robustness of the black-box model in many situations.
    Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. (arXiv:2201.01266v1 [eess.IV])
    (0 min) Semantic segmentation of brain tumors is a fundamental medical image analysis task involving multiple MRI imaging modalities that can assist clinicians in diagnosing the patient and successively studying the progression of the malignant entity. In recent years, Fully Convolutional Neural Networks (FCNNs) approaches have become the de facto standard for 3D medical image segmentation. The popular "U-shaped" network architecture has achieved state-of-the-art performance benchmarks on different 2D and 3D semantic segmentation tasks and across various imaging modalities. However, due to the limited kernel size of convolution layers in FCNNs, their performance of modeling long-range information is sub-optimal, and this can lead to deficiencies in the segmentation of tumors with variable sizes. On the other hand, transformer models have demonstrated excellent capabilities in capturing such long-range information in multiple domains, including natural language processing and computer vision. Inspired by the success of vision transformers and their variants, we propose a novel segmentation model termed Swin UNEt TRansformers (Swin UNETR). Specifically, the task of 3D brain tumor semantic segmentation is reformulated as a sequence to sequence prediction problem wherein multi-modal input data is projected into a 1D sequence of embedding and used as an input to a hierarchical Swin transformer as the encoder. The swin transformer encoder extracts features at five different resolutions by utilizing shifted windows for computing self-attention and is connected to an FCNN-based decoder at each resolution via skip connections. We have participated in BraTS 2021 segmentation challenge, and our proposed model ranks among the top-performing approaches in the validation phase. Code: https://monai.io/research/swin-unetr
    Parity-based Cumulative Fairness-aware Boosting. (arXiv:2201.01148v1 [cs.LG])
    (0 min) Data-driven AI systems can lead to discrimination on the basis of protected attributes like gender or race. One reason for this behavior is the encoded societal biases in the training data (e.g., females are underrepresented), which is aggravated in the presence of unbalanced class distributions (e.g., "granted" is the minority class). State-of-the-art fairness-aware machine learning approaches focus on preserving the \emph{overall} classification accuracy while improving fairness. In the presence of class-imbalance, such methods may further aggravate the problem of discrimination by denying an already underrepresented group (e.g., \textit{females}) the fundamental rights of equal social privileges (e.g., equal credit opportunity). To this end, we propose AdaFair, a fairness-aware boosting ensemble that changes the data distribution at each round, taking into account not only the class errors but also the fairness-related performance of the model defined cumulatively based on the partial ensemble. Except for the in-training boosting of the group discriminated over each round, AdaFair directly tackles imbalance during the post-training phase by optimizing the number of ensemble learners for balanced error performance (BER). AdaFair can facilitate different parity-based fairness notions and mitigate effectively discriminatory outcomes. Our experiments show that our approach can achieve parity in terms of statistical parity, equal opportunity, and disparate mistreatment while maintaining good predictive performance for all classes.
    Neural Mean Discrepancy for Efficient Out-of-Distribution Detection. (arXiv:2104.11408v3 [cs.LG] UPDATED)
    (0 min) Various approaches have been proposed for out-of-distribution (OOD) detection by augmenting models, input examples, training sets, and optimization objectives. Deviating from existing work, we have a simple hypothesis that standard off-the-shelf models may already contain sufficient information about the training set distribution which can be leveraged for reliable OOD detection. Our empirical study on validating this hypothesis, which measures the model activation's mean for OOD and in-distribution (ID) mini-batches, surprisingly finds that activation means of OOD mini-batches consistently deviate more from those of the training data. In addition, training data's activation means can be computed offline efficiently or retrieved from batch normalization layers as a `free lunch'. Based upon this observation, we propose a novel metric called Neural Mean Discrepancy (NMD), which compares neural means of the input examples and training data. Leveraging the simplicity of NMD, we propose an efficient OOD detector that computes neural means by a standard forward pass followed by a lightweight classifier. Extensive experiments show that NMD outperforms state-of-the-art OOD approaches across multiple datasets and model architectures in terms of both detection accuracy and computational cost.
    Profile Guided Optimization without Profiles: A Machine Learning Approach. (arXiv:2112.14679v2 [cs.PL] UPDATED)
    (0 min) Profile guided optimization is an effective technique for improving the optimization ability of compilers based on dynamic behavior, but collecting profile data is expensive, cumbersome, and requires regular updating to remain fresh. We present a novel statistical approach to inferring branch probabilities that improves the performance of programs that are compiled without profile guided optimizations. We perform offline training using information that is collected from a large corpus of binaries that have branch probabilities information. The learned model is used by the compiler to predict the branch probabilities of regular uninstrumented programs, which the compiler can then use to inform optimization decisions. We integrate our technique directly in LLVM, supplementing the existing human-engineered compiler heuristics. We evaluate our technique on a suite of benchmarks, demonstrating some gains over compiling without profile information. In deployment, our technique requires no profiling runs and has negligible effect on compilation time.
    Modeling Users' Behavior Sequences with Hierarchical Explainable Network for Cross-domain Fraud Detection. (arXiv:2201.01004v1 [cs.LG])
    (0 min) With the explosive growth of the e-commerce industry, detecting online transaction fraud in real-world applications has become increasingly important to the development of e-commerce platforms. The sequential behavior history of users provides useful information in differentiating fraudulent payments from regular ones. Recently, some approaches have been proposed to solve this sequence-based fraud detection problem. However, these methods usually suffer from two problems: the prediction results are difficult to explain and the exploitation of the internal information of behaviors is insufficient. To tackle the above two problems, we propose a Hierarchical Explainable Network (HEN) to model users' behavior sequences, which could not only improve the performance of fraud detection but also make the inference process interpretable. Meanwhile, as e-commerce business expands to new domains, e.g., new countries or new markets, one major problem for modeling user behavior in fraud detection systems is the limitation of data collection, e.g., very few data/labels available. Thus, in this paper, we further propose a transfer framework to tackle the cross-domain fraud detection problem, which aims to transfer knowledge from existing domains (source domains) with enough and mature data to improve the performance in the new domain (target domain). Our proposed method is a general transfer framework that could not only be applied upon HEN but also various existing models in the Embedding & MLP paradigm. Based on 90 transfer task experiments, we also demonstrate that our transfer framework could not only contribute to the cross-domain fraud detection task with HEN, but also be universal and expandable for various existing models.
    Multivariate Time Series Regression with Graph Neural Networks. (arXiv:2201.00818v1 [cs.LG])
    (0 min) Machine learning, with its advances in Deep Learning has shown great potential in analysing time series in the past. However, in many scenarios, additional information is available that can potentially improve predictions, by incorporating it into the learning methods. This is crucial for data that arises from e.g., sensor networks that contain information about sensor locations. Then, such spatial information can be exploited by modeling it via graph structures, along with the sequential (time) information. Recent advances in adapting Deep Learning to graphs have shown promising potential in various graph-related tasks. However, these methods have not been adapted for time series related tasks to a great extent. Specifically, most attempts have essentially consolidated around Spatial-Temporal Graph Neural Networks for time series forecasting with small sequence lengths. Generally, these architectures are not suited for regression or classification tasks that contain large sequences of data. Therefore, in this work, we propose an architecture capable of processing these long sequences in a multivariate time series regression task, using the benefits of Graph Neural Networks to improve predictions. Our model is tested on two seismic datasets that contain earthquake waveforms, where the goal is to predict intensity measurements of ground shaking at a set of stations. Our findings demonstrate promising results of our approach, which are discussed in depth with an additional ablation study.
    Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients. (arXiv:2201.01247v1 [cs.MA])
    (0 min) Value function factorization via centralized training and decentralized execution is promising for solving cooperative multi-agent reinforcement tasks. One of the approaches in this area, QMIX, has become state-of-the-art and achieved the best performance on the StarCraft II micromanagement benchmark. However, the monotonic-mixing of per agent estimates in QMIX is known to restrict the joint action Q-values it can represent, as well as the insufficient global state information for single agent value function estimation, often resulting in suboptimality. To this end, we present LSF-SAC, a novel framework that features a variational inference-based information-sharing mechanism as extra state information to assist individual agents in the value function factorization. We demonstrate that such latent individual state information sharing can significantly expand the power of value function factorization, while fully decentralized execution can still be maintained in LSF-SAC through a soft-actor-critic design. We evaluate LSF-SAC on the StarCraft II micromanagement challenge and demonstrate that it outperforms several state-of-the-art methods in challenging collaborative tasks. We further set extensive ablation studies for locating the key factors accounting for its performance improvements. We believe that this new insight can lead to new local value estimation methods and variational deep learning algorithms. A demo video and code of implementation can be found at https://sites.google.com/view/sacmm.
    Automated Graph Machine Learning: Approaches, Libraries and Directions. (arXiv:2201.01288v1 [cs.LG])
    (0 min) Graph machine learning has been extensively studied in both academic and industry. However, as the literature on graph learning booms with a vast number of emerging methods and techniques, it becomes increasingly difficult to manually design the optimal machine learning algorithm for different graph-related tasks. To tackle the challenge, automated graph machine learning, which aims at discovering the best hyper-parameter and neural architecture configuration for different graph tasks/data without manual design, is gaining an increasing number of attentions from the research community. In this paper, we extensively discuss automated graph machine approaches, covering hyper-parameter optimization (HPO) and neural architecture search (NAS) for graph machine learning. We briefly overview existing libraries designed for either graph machine learning or automated machine learning respectively, and further in depth introduce AutoGL, our dedicated and the world's first open-source library for automated graph machine learning. Last but not least, we share our insights on future research directions for automated graph machine learning. This paper is the first systematic and comprehensive discussion of approaches, libraries as well as directions for automated graph machine learning.
    Efficient Multi-Organ Segmentation Using SpatialConfiguration-Net with Low GPU Memory Requirements. (arXiv:2111.13630v2 [eess.IV] UPDATED)
    (0 min) Even though many semantic segmentation methods exist that are able to perform well on many medical datasets, often, they are not designed for direct use in clinical practice. The two main concerns are generalization to unseen data with a different visual appearance, e.g., images acquired using a different scanner, and efficiency in terms of computation time and required Graphics Processing Unit (GPU) memory. In this work, we employ a multi-organ segmentation model based on the SpatialConfiguration-Net (SCN), which integrates prior knowledge of the spatial configuration among the labelled organs to resolve spurious responses in the network outputs. Furthermore, we modified the architecture of the segmentation model to reduce its memory footprint as much as possible without drastically impacting the quality of the predictions. Lastly, we implemented a minimal inference script for which we optimized both, execution time and required GPU memory.
    Resilience Aspects in Distributed Wireless Electroencephalographic Sampling. (arXiv:2201.01272v1 [eess.SP])
    (0 min) Resilience aspects of remote electroencephalography sampling are considered. The possibility to use motion sensors data and measurement of industrial power network interference for detection of failed sampling channels is demonstrated. No significant correlation between signals of failed channels and motion sensors data is shown. Level of 50 Hz spectral component from failed channels significantly differs from level of 50 Hz component of normally operating channel. Conclusions about application of these results for increasing resilience of electroencephalography sampling is made.
    Compensation Learning. (arXiv:2107.11921v4 [cs.LG] UPDATED)
    (0 min) Weighting strategy prevails in machine learning. For example, a common approach in robust machine learning is to exert lower weights on samples which are likely to be noisy or quite hard. This study reveals another undiscovered strategy, namely, compensating. Various incarnations of compensating have been utilized but it has not been explicitly revealed. Learning with compensating is called compensation learning and a systematic taxonomy is constructed for it in this study. In our taxonomy, compensation learning is divided on the basis of the compensation targets, directions, inference manners, and granularity levels. Many existing learning algorithms including some classical ones can be viewed or understood at least partially as compensation techniques. Furthermore, a family of new learning algorithms can be obtained by plugging the compensation learning into existing learning algorithms. Specifically, two concrete new learning algorithms are proposed for robust machine learning. Extensive experiments on image classification and text sentiment analysis verify the effectiveness of the two new algorithms. Compensation learning can also be used in other various learning scenarios, such as imbalance learning, clustering, regression, and so on.
    Common Misconceptions about Population Data. (arXiv:2112.10912v2 [cs.DB] UPDATED)
    (0 min) Databases covering all individuals of a population are increasingly used for research studies in domains ranging from public health to the social sciences. There is also growing interest by governments and businesses to use population data to support data-driven decision making. The massive size of such databases is often mistaken as a guarantee for valid inferences on the population of interest. However, population data have characteristics that make them challenging to use, including various assumptions being made how such data were collected and what types of processing have been applied to them. Furthermore, the full potential of population data can often only be unlocked when such data are linked to other databases, a process that adds fresh challenges. This article discusses a diverse range of misconceptions about population data that we believe anybody who works with such data needs to be aware of. Many of these misconceptions are not well documented in scientific publications but only discussed anecdotally among researchers and practitioners. We conclude with a set of recommendations for inference when using population data.
    Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition. (arXiv:2112.05820v3 [cs.CL] UPDATED)
    (0 min) The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.3% and 4.6% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, the use of language ID and the label decoder with the MoE.
    Robust Semi-supervised Federated Learning for Images Automatic Recognition in Internet of Drones. (arXiv:2201.01230v1 [cs.LG])
    (0 min) Air access networks have been recognized as a significant driver of various Internet of Things (IoT) services and applications. In particular, the aerial computing network infrastructure centered on the Internet of Drones has set off a new revolution in automatic image recognition. This emerging technology relies on sharing ground truth labeled data between Unmanned Aerial Vehicle (UAV) swarms to train a high-quality automatic image recognition model. However, such an approach will bring data privacy and data availability challenges. To address these issues, we first present a Semi-supervised Federated Learning (SSFL) framework for privacy-preserving UAV image recognition. Specifically, we propose model parameters mixing strategy to improve the naive combination of FL and semi-supervised learning methods under two realistic scenarios (labels-at-client and labels-at-server), which is referred to as Federated Mixing (FedMix). Furthermore, there are significant differences in the number, features, and distribution of local data collected by UAVs using different camera modules in different environments, i.e., statistical heterogeneity. To alleviate the statistical heterogeneity problem, we propose an aggregation rule based on the frequency of the client's participation in training, namely the FedFreq aggregation rule, which can adjust the weight of the corresponding local model according to its frequency. Numerical results demonstrate that the performance of our proposed method is significantly better than those of the current baseline and is robust to different non-IID levels of client data.
    Two-level Graph Neural Network. (arXiv:2201.01190v1 [cs.LG])
    (0 min) Graph Neural Networks (GNNs) are recently proposed neural network structures for the processing of graph-structured data. Due to their employed neighbor aggregation strategy, existing GNNs focus on capturing node-level information and neglect high-level information. Existing GNNs therefore suffer from representational limitations caused by the Local Permutation Invariance (LPI) problem. To overcome these limitations and enrich the features captured by GNNs, we propose a novel GNN framework, referred to as the Two-level GNN (TL-GNN). This merges subgraph-level information with node-level information. Moreover, we provide a mathematical analysis of the LPI problem which demonstrates that subgraph-level information is beneficial to overcoming the problems associated with LPI. A subgraph counting method based on the dynamic programming algorithm is also proposed, and this has time complexity is O(n^3), n is the number of nodes of a graph. Experiments show that TL-GNN outperforms existing GNNs and achieves state-of-the-art performance.
    Combat Data Shift in Few-shot Learning with Knowledge Graph. (arXiv:2101.11354v3 [cs.LG] UPDATED)
    (0 min) Many few-shot learning approaches have been designed under the meta-learning framework, which learns from a variety of learning tasks and generalizes to new tasks. These meta-learning approaches achieve the expected performance in the scenario where all samples are drawn from the same distributions (i.i.d. observations). However, in real-world applications, few-shot learning paradigm often suffers from data shift, i.e., samples in different tasks, even in the same task, could be drawn from various data distributions. Most existing few-shot learning approaches are not designed with the consideration of data shift, and thus show downgraded performance when data distribution shifts. However, it is non-trivial to address the data shift problem in few-shot learning, due to the limited number of labeled samples in each task. Targeting at addressing this problem, we propose a novel metric-based meta-learning framework to extract task-specific representations and task-shared representations with the help of knowledge graph. The data shift within/between tasks can thus be combated by the combination of task-shared and task-specific representations. The proposed model is evaluated on popular benchmarks and two constructed new challenging datasets. The evaluation results demonstrate its remarkable performance.
    Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks. (arXiv:2110.03825v4 [cs.LG] UPDATED)
    (0 min) Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks. A range of defense methods have been proposed to train adversarially robust DNNs, among which adversarial training has demonstrated promising results. However, despite preliminary understandings developed for adversarial training, it is still not clear, from the architectural perspective, what configurations can lead to more robust DNNs. In this paper, we address this gap via a comprehensive investigation on the impact of network width and depth on the robustness of adversarially trained DNNs. Specifically, we make the following key observations: 1) more parameters (higher model capacity) does not necessarily help adversarial robustness; 2) reducing capacity at the last stage (the last group of blocks) of the network can actually improve adversarial robustness; and 3) under the same parameter budget, there exists an optimal architectural configuration for adversarial robustness. We also provide a theoretical analysis explaning why such network configuration can help robustness. These architectural insights can help design adversarially robust DNNs. Code is available at \url{https://github.com/HanxunH/RobustWRN}.
    PluGeN: Multi-Label Conditional Generation From Pre-Trained Models. (arXiv:2109.09011v2 [cs.LG] UPDATED)
    (2 min) Modern generative models achieve excellent quality in a variety of tasks including image or text generation and chemical molecule modeling. However, existing methods often lack the essential ability to generate examples with requested properties, such as the age of the person in the photo or the weight of the generated molecule. Incorporating such additional conditioning factors would require rebuilding the entire architecture and optimizing the parameters from scratch. Moreover, it is difficult to disentangle selected attributes so that to perform edits of only one attribute while leaving the others unchanged. To overcome these limitations we propose PluGeN (Plugin Generative Network), a simple yet effective generative technique that can be used as a plugin to pre-trained generative models. The idea behind our approach is to transform the entangled latent representation using a flow-based module into a multi-dimensional space where the values of each attribute are modeled as an independent one-dimensional distribution. In consequence, PluGeN can generate new samples with desired attributes as well as manipulate labeled attributes of existing examples. Due to the disentangling of the latent representation, we are even able to generate samples with rare or unseen combinations of attributes in the dataset, such as a young person with gray hair, men with make-up, or women with beards. We combined PluGeN with GAN and VAE models and applied it to conditional generation and manipulation of images and chemical molecule modeling. Experiments demonstrate that PluGeN preserves the quality of backbone models while adding the ability to control the values of labeled attributes.
    Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization. (arXiv:2008.08170v5 [math.OC] UPDATED)
    (0 min) In the paper, we propose a class of accelerated zeroth-order and first-order momentum methods for both nonconvex mini-optimization and minimax-optimization. Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method for black-box mini-optimization. Moreover, we prove that our Acc-ZOM method achieves a lower query complexity of $\tilde{O}(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(d^{1/4})$ where $d$ denotes the variable dimension. In particular, the Acc-ZOM does not require large batches required in the existing zeroth-order stochastic algorithms. Meanwhile, we propose an accelerated \textbf{zeroth-order} momentum descent ascent (Acc-ZOMDA) method for \textbf{black-box} minimax-optimization, which obtains a query complexity of $\tilde{O}((d_1+d_2)^{3/4}\kappa_y^{4.5}\epsilon^{-3})$ without large batches for finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote variable dimensions and $\kappa_y$ is condition number. Moreover, we propose an accelerated \textbf{first-order} momentum descent ascent (Acc-MDA) method for \textbf{white-box} minimax optimization, which has a gradient complexity of $\tilde{O}(\kappa_y^{4.5}\epsilon^{-3})$ without large batches for finding an $\epsilon$-stationary point. In particular, our Acc-MDA can obtain a lower gradient complexity of $\tilde{O}(\kappa_y^{2.5}\epsilon^{-3})$ with a batch size $O(\kappa_y^4)$. Extensive experimental results on the black-box adversarial attack to deep neural networks (DNNs) and poisoning attack demonstrate efficiency of our algorithms.
    Learning to Generate Novel Classes for Deep Metric Learning. (arXiv:2201.01008v1 [cs.CV])
    (0 min) Deep metric learning aims to learn an embedding space where the distance between data reflects their class equivalence, even when their classes are unseen during training. However, the limited number of classes available in training precludes generalization of the learned embedding space. Motivated by this, we introduce a new data augmentation approach that synthesizes novel classes and their embedding vectors. Our approach can provide rich semantic information to an embedding model and improve its generalization by augmenting training data with novel classes unavailable in the original data. We implement this idea by learning and exploiting a conditional generative model, which, given a class label and a noise, produces a random embedding vector of the class. Our proposed generator allows the loss to use richer class relations by augmenting realistic and diverse classes, resulting in better generalization to unseen samples. Experimental results on public benchmark datasets demonstrate that our method clearly enhances the performance of proxy-based losses.
    Semi-relaxed Gromov-Wasserstein divergence with applications on graphs. (arXiv:2110.02753v2 [cs.LG] UPDATED)
    (0 min) Comparing structured objects such as graphs is a fundamental operation involved in many learning tasks. To this end, the Gromov-Wasserstein (GW) distance, based on Optimal Transport (OT), has proven to be successful in handling the specific nature of the associated objects. More specifically, through the nodes connectivity relations, GW operates on graphs, seen as probability measures over specific spaces. At the core of OT is the idea of conservation of mass, which imposes a coupling between all the nodes from the two considered graphs. We argue in this paper that this property can be detrimental for tasks such as graph dictionary or partition learning, and we relax it by proposing a new semi-relaxed Gromov-Wasserstein divergence. Aside from immediate computational benefits, we discuss its properties, and show that it can lead to an efficient graph dictionary learning algorithm. We empirically demonstrate its relevance for complex tasks on graphs such as partitioning, clustering and completion.
    Be Confident! Towards Trustworthy Graph Neural Networks via Confidence Calibration. (arXiv:2109.14285v3 [cs.LG] UPDATED)
    (0 min) Despite Graph Neural Networks (GNNs) have achieved remarkable accuracy, whether the results are trustworthy is still unexplored. Previous studies suggest that many modern neural networks are over-confident on the predictions, however, surprisingly, we discover that GNNs are primarily in the opposite direction, i.e., GNNs are under-confident. Therefore, the confidence calibration for GNNs is highly desired. In this paper, we propose a novel trustworthy GNN model by designing a topology-aware post-hoc calibration function. Specifically, we first verify that the confidence distribution in a graph has homophily property, and this finding inspires us to design a calibration GNN model (CaGCN) to learn the calibration function. CaGCN is able to obtain a unique transformation from logits of GNNs to the calibrated confidence for each node, meanwhile, such transformation is able to preserve the order between classes, satisfying the accuracy-preserving property. Moreover, we apply the calibration GNN to self-training framework, showing that more trustworthy pseudo labels can be obtained with the calibrated confidence and further improve the performance. Extensive experiments demonstrate the effectiveness of our proposed model in terms of both calibration and accuracy.
    Learning to Play No-Press Diplomacy with Best Response Policy Iteration. (arXiv:2006.04635v4 [cs.LG] UPDATED)
    (2 min) Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.
    A Survey On Universal Adversarial Attack. (arXiv:2103.01498v2 [cs.LG] UPDATED)
    (2 min) The intriguing phenomenon of adversarial examples has attracted significant attention in machine learning and what might be more surprising to the community is the existence of universal adversarial perturbations (UAPs), i.e. a single perturbation to fool the target DNN for most images. With the focus on UAP against deep classifiers, this survey summarizes the recent progress on universal adversarial attacks, discussing the challenges from both the attack and defense sides, as well as the reason for the existence of UAP. We aim to extend this work as a dynamic survey that will regularly update its content to follow new works regarding UAP or universal attack in a wide range of domains, such as image, audio, video, text, etc. Relevant updates will be discussed at: https://bit.ly/2SbQlLG. We welcome authors of future works in this field to contact us for including your new finding.
    On the effectiveness of adversarial training against common corruptions. (arXiv:2103.02325v2 [cs.LG] UPDATED)
    (2 min) The literature on robustness towards common corruptions shows no consensus on whether adversarial training can improve the performance in this setting. First, we show that, when used with an appropriately selected perturbation radius, $\ell_p$ adversarial training can serve as a strong baseline against common corruptions improving both accuracy and calibration. Then we explain why adversarial training performs better than data augmentation with simple Gaussian noise which has been observed to be a meaningful baseline on common corruptions. Related to this, we identify the $\sigma$-overfitting phenomenon when Gaussian augmentation overfits to a particular standard deviation used for training which has a significant detrimental effect on common corruption accuracy. We discuss how to alleviate this problem and then how to further enhance $\ell_p$ adversarial training by introducing an efficient relaxation of adversarial training with learned perceptual image patch similarity as the distance metric. Through experiments on CIFAR-10 and ImageNet-100, we show that our approach does not only improve the $\ell_p$ adversarial training baseline but also has cumulative gains with data augmentation methods such as AugMix, DeepAugment, ANT, and SIN, leading to state-of-the-art performance on common corruptions. The code of our experiments is publicly available at https://github.com/tml-epfl/adv-training-corruptions.
    3DVSR: 3D EPI Volume-based Approach for Angular and Spatial Light field Image Super-resolution. (arXiv:2201.01294v1 [cs.CV])
    (2 min) Light field (LF) imaging, which captures both spatial and angular information of a scene, is undoubtedly beneficial to numerous applications. Although various techniques have been proposed for LF acquisition, achieving both angularly and spatially high-resolution LF remains a technology challenge. In this paper, a learning-based approach applied to 3D epipolar image (EPI) is proposed to reconstruct high-resolution LF. Through a 2-stage super-resolution framework, the proposed approach effectively addresses various LF super-resolution (SR) problems, i.e., spatial SR, angular SR, and angular-spatial SR. While the first stage provides flexible options to up-sample EPI volume to the desired resolution, the second stage, which consists of a novel EPI volume-based refinement network (EVRN), substantially enhances the quality of the high-resolution EPI volume. An extensive evaluation on 90 challenging synthetic and real-world light field scenes from 7 published datasets shows that the proposed approach outperforms state-of-the-art methods to a large extend for both spatial and angular super-resolution problem, i.e., an average peak signal to noise ratio improvement of more than 2.0 dB, 1.4 dB, and 3.14 dB in spatial SR $\times 2$, spatial SR $\times 4$, and angular SR respectively. The reconstructed 4D light field demonstrates a balanced performance distribution across all perspective images and presents superior visual quality compared to the previous works.
    Predicting Influenza A Viral Host Using PSSM and Word Embeddings. (arXiv:2201.01140v1 [cs.CL])
    (2 min) The rapid mutation of the influenza virus threatens public health. Reassortment among viruses with different hosts can lead to a fatal pandemic. However, it is difficult to detect the original host of the virus during or after an outbreak as influenza viruses can circulate between different species. Therefore, early and rapid detection of the viral host would help reduce the further spread of the virus. We use various machine learning models with features derived from the position-specific scoring matrix (PSSM) and features learned from word embedding and word encoding to infer the origin host of viruses. The results show that the performance of the PSSM-based model reaches the MCC around 95%, and the F1 around 96%. The MCC obtained using the model with word embedding is around 96%, and the F1 is around 97%.
    Visual Microfossil Identification via Deep Metric Learning. (arXiv:2112.09490v2 [cs.CV] UPDATED)
    (0 min) We apply deep metric learning for the first time to the prob-lem of classifying planktic foraminifer shells on microscopic images. This species recognition task is an important information source and scientific pillar for reconstructing past climates. All foraminifer CNN recognition pipelines in the literature produce black-box classifiers that lack visualisation options for human experts and cannot be applied to open set problems. Here, we benchmark metric learning against these pipelines, produce the first scientific visualisation of the phenotypic planktic foraminifer morphology space, and demonstrate that metric learning can be used to cluster species unseen during training. We show that metric learning out-performs all published CNN-based state-of-the-art benchmarks in this domain. We evaluate our approach on the 34,640 expert-annotated images of the Endless Forams public library of 35 modern planktic foraminifera species. Our results on this data show leading 92% accuracy (at 0.84 F1-score) in reproducing expert labels on withheld test data, and 66.5% accuracy (at 0.70 F1-score) when clustering species never encountered in training. We conclude that metric learning is highly effective for this domain and serves as an important tool towards expert-in-the-loop automation of microfossil identification. Key code, network weights, and data splits are published with this paper for full reproducibility.
    ExAID: A Multimodal Explanation Framework for Computer-Aided Diagnosis of Skin Lesions. (arXiv:2201.01249v1 [cs.AI])
    (2 min) One principal impediment in the successful deployment of AI-based Computer-Aided Diagnosis (CAD) systems in clinical workflows is their lack of transparent decision making. Although commonly used eXplainable AI methods provide some insight into opaque algorithms, such explanations are usually convoluted and not readily comprehensible except by highly trained experts. The explanation of decisions regarding the malignancy of skin lesions from dermoscopic images demands particular clarity, as the underlying medical problem definition is itself ambiguous. This work presents ExAID (Explainable AI for Dermatology), a novel framework for biomedical image analysis, providing multi-modal concept-based explanations consisting of easy-to-understand textual explanations supplemented by visual maps justifying the predictions. ExAID relies on Concept Activation Vectors to map human concepts to those learnt by arbitrary Deep Learning models in latent space, and Concept Localization Maps to highlight concepts in the input space. This identification of relevant concepts is then used to construct fine-grained textual explanations supplemented by concept-wise location information to provide comprehensive and coherent multi-modal explanations. All information is comprehensively presented in a diagnostic interface for use in clinical routines. An educational mode provides dataset-level explanation statistics and tools for data and model exploration to aid medical research and education. Through rigorous quantitative and qualitative evaluation of ExAID, we show the utility of multi-modal explanations for CAD-assisted scenarios even in case of wrong predictions. We believe that ExAID will provide dermatologists an effective screening tool that they both understand and trust. Moreover, it will be the basis for similar applications in other biomedical imaging fields.
    Towards a Rigorous Evaluation of Time-series Anomaly Detection. (arXiv:2109.05257v2 [cs.LG] UPDATED)
    (2 min) In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets, giving the impression of clear improvements in TAD. However, most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. In this paper, we theoretically and experimentally reveal that the PA protocol has a great possibility of overestimating the detection performance; that is, even a random anomaly score can easily turn into a state-of-the-art TAD method. Therefore, the comparison of TAD methods after applying the PA protocol can lead to misguided rankings. Furthermore, we question the potential of existing TAD methods by showing that an untrained model obtains comparable detection performance to the existing methods even when PA is forbidden. Based on our findings, we propose a new baseline and an evaluation protocol. We expect that our study will help a rigorous evaluation of TAD and lead to further improvement in future researches.
    Graph Neural Networks: a bibliometrics overview. (arXiv:2201.01188v1 [cs.LG])
    (2 min) Recently, graph neural networks have become a hot topic in machine learning community. This paper presents a Scopus based bibliometric overview of the GNNs research since 2004, when GNN papers were first published. The study aims to evaluate GNN research trend, both quantitatively and qualitatively. We provide the trend of research, distribution of subjects, active and influential authors and institutions, sources of publications, most cited documents, and hot topics. Our investigations reveal that the most frequent subject categories in this field are computer science, engineering, telecommunications, linguistics, operations research and management science, information science and library science, business and economics, automation and control systems, robotics, and social sciences. In addition, the most active source of GNN publications is Lecture Notes in Computer Science. The most prolific or impactful institutions are found in the United States, China, and Canada. We also provide must read papers and future directions. Finally, the application of graph convolutional networks and attention mechanism are now among hot topics of GNN research.
    The cluster structure function. (arXiv:2201.01222v1 [cs.LG])
    (2 min) For each partition of a data set into a given number of parts there is a partition such that every part is as much as possible a good model (an "algorithmic sufficient statistic") for the data in that part. Since this can be done for every number between one and the number of data, the result is a function, the cluster structure function. It maps the number of parts of a partition to values related to the deficiencies of being good models by the parts. Such a function starts with a value at least zero for no partition of the data set and descents to zero for the partition of the data set into singleton parts. The optimal clustering is the one chosen to minimize the cluster structure function. The theory behind the method is expressed in algorithmic information theory (Kolmogorov complexity). In practice the Kolmogorov complexities involved are approximated by a concrete compressor. We give examples using real data sets: the MNIST handwritten digits and the segmentation of real cells as used in stem cell research.
    AutoBalance: Optimized Loss Functions for Imbalanced Data. (arXiv:2201.01212v1 [cs.LG])
    (2 min) Imbalanced datasets are commonplace in modern machine learning problems. The presence of under-represented classes or groups with sensitive attributes results in concerns about generalization and fairness. Such concerns are further exacerbated by the fact that large capacity deep nets can perfectly fit the training data and appear to achieve perfect accuracy and fairness during training, but perform poorly during test. To address these challenges, we propose AutoBalance, a bi-level optimization framework that automatically designs a training loss function to optimize a blend of accuracy and fairness-seeking objectives. Specifically, a lower-level problem trains the model weights, and an upper-level problem tunes the loss function by monitoring and optimizing the desired objective over the validation data. Our loss design enables personalized treatment for classes/groups by employing a parametric cross-entropy loss and individualized data augmentation schemes. We evaluate the benefits and performance of our approach for the application scenarios of imbalanced and group-sensitive classification. Extensive empirical evaluations demonstrate the benefits of AutoBalance over state-of-the-art approaches. Our experimental findings are complemented with theoretical insights on loss function design and the benefits of train-validation split. All code is available open-source.
    Sequoia: A Software Framework to Unify Continual Learning Research. (arXiv:2108.01005v3 [cs.LG] UPDATED)
    (2 min) The field of Continual Learning (CL) seeks to develop algorithms that accumulate knowledge and skills over time through interaction with non-stationary environments. In practice, a plethora of evaluation procedures (settings) and algorithmic solutions (methods) exist, each with their own potentially disjoint set of assumptions. This variety makes measuring progress in CL difficult. We propose a taxonomy of settings, where each setting is described as a set of assumptions. A tree-shaped hierarchy emerges from this view, where more general settings become the parents of those with more restrictive assumptions. This makes it possible to use inheritance to share and reuse research, as developing a method for a given setting also makes it directly applicable onto any of its children. We instantiate this idea as a publicly available software framework called Sequoia, which features a wide variety of settings from both the Continual Supervised Learning (CSL) and Continual Reinforcement Learning (CRL) domains. Sequoia also includes a growing suite of methods which are easy to extend and customize, in addition to more specialized methods from external libraries. We hope that this new paradigm and its first implementation can help unify and accelerate research in CL. You can help us grow the tree by visiting github.com/lebrice/Sequoia.
    Neural Actor: Neural Free-view Synthesis of Human Actors with Pose Control. (arXiv:2106.02019v2 [cs.CV] UPDATED)
    (0 min) We propose Neural Actor (NA), a new method for high-quality synthesis of humans from arbitrary viewpoints and under arbitrary controllable poses. Our method is built upon recent neural scene representation and rendering works which learn representations of geometry and appearance from only 2D images. While existing works demonstrated compelling rendering of static scenes and playback of dynamic scenes, photo-realistic reconstruction and rendering of humans with neural implicit methods, in particular under user-controlled novel poses, is still difficult. To address this problem, we utilize a coarse body model as the proxy to unwarp the surrounding 3D space into a canonical pose. A neural radiance field learns pose-dependent geometric deformations and pose- and view-dependent appearance effects in the canonical space from multi-view video input. To synthesize novel views of high fidelity dynamic geometry and appearance, we leverage 2D texture maps defined on the body model as latent variables for predicting residual deformations and the dynamic appearance. Experiments demonstrate that our method achieves better quality than the state-of-the-arts on playback as well as novel pose synthesis, and can even generalize well to new poses that starkly differ from the training poses. Furthermore, our method also supports body shape control of the synthesized results.
    Monitoring and Anomaly Detection Actor-Critic Based Controlled Sensing. (arXiv:2201.00879v1 [cs.LG])
    (2 min) We address the problem of monitoring a set of binary stochastic processes and generating an alert when the number of anomalies among them exceeds a threshold. For this, the decision-maker selects and probes a subset of the processes to obtain noisy estimates of their states (normal or anomalous). Based on the received observations, the decisionmaker first determines whether to declare that the number of anomalies has exceeded the threshold or to continue taking observations. When the decision is to continue, it then decides whether to collect observations at the next time instant or defer it to a later time. If it chooses to collect observations, it further determines the subset of processes to be probed. To devise this three-step sequential decision-making process, we use a Bayesian formulation wherein we learn the posterior probability on the states of the processes. Using the posterior probability, we construct a Markov decision process and solve it using deep actor-critic reinforcement learning. Via numerical experiments, we demonstrate the superior performance of our algorithm compared to the traditional model-based algorithms.
    Variance-Aware Off-Policy Evaluation with Linear Function Approximation. (arXiv:2106.11960v2 [cs.LG] UPDATED)
    (2 min) We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
    Neural Piecewise-Constant Delay Differential Equations. (arXiv:2201.00960v1 [cs.LG])
    (2 min) Continuous-depth neural networks, such as the Neural Ordinary Differential Equations (ODEs), have aroused a great deal of interest from the communities of machine learning and data science in recent years, which bridge the connection between deep neural networks and dynamical systems. In this article, we introduce a new sort of continuous-depth neural network, called the Neural Piecewise-Constant Delay Differential Equations (PCDDEs). Here, unlike the recently proposed framework of the Neural Delay Differential Equations (DDEs), we transform the single delay into the piecewise-constant delay(s). The Neural PCDDEs with such a transformation, on one hand, inherit the strength of universal approximating capability in Neural DDEs. On the other hand, the Neural PCDDEs, leveraging the contributions of the information from the multiple previous time steps, further promote the modeling capability without augmenting the network dimension. With such a promotion, we show that the Neural PCDDEs do outperform the several existing continuous-depth neural frameworks on the one-dimensional piecewise-constant delay population dynamics and real-world datasets, including MNIST, CIFAR10, and SVHN.
    Evolutionary Multitasking AUC Optimization. (arXiv:2201.01145v1 [cs.LG])
    (2 min) Learning to optimize the area under the receiver operating characteristics curve (AUC) performance for imbalanced data has attracted much attention in recent years. Although there have been several methods of AUC optimization, scaling up AUC optimization is still an open issue due to its pairwise learning style. Maximizing AUC in the large-scale dataset can be considered as a non-convex and expensive problem. Inspired by the characteristic of pairwise learning, the cheap AUC optimization task with a small-scale dataset sampled from the large-scale dataset is constructed to promote the AUC accuracy of the original, large-scale, and expensive AUC optimization task. This paper develops an evolutionary multitasking framework (termed EMTAUC) to make full use of information among the constructed cheap and expensive tasks to obtain higher performance. In EMTAUC, one mission is to optimize AUC from the sampled dataset, and the other is to maximize AUC from the original dataset. Moreover, due to the cheap task containing limited knowledge, a strategy for dynamically adjusting the data structure of inexpensive tasks is proposed to introduce more knowledge into the multitasking AUC optimization environment. The performance of the proposed method is evaluated on a series of binary classification datasets. The experimental results demonstrate that EMTAUC is highly competitive to single task methods and online methods. Supplementary materials and source code implementation of EMTAUC can be accessed at https://github.com/xiaofangxd/EMTAUC.
    Supervised Homogeneity Fusion: a Combinatorial Approach. (arXiv:2201.01036v1 [stat.ML])
    (2 min) Fusing regression coefficients into homogenous groups can unveil those coefficients that share a common value within each group. Such groupwise homogeneity reduces the intrinsic dimension of the parameter space and unleashes sharper statistical accuracy. We propose and investigate a new combinatorial grouping approach called $L_0$-Fusion that is amenable to mixed integer optimization (MIO). On the statistical aspect, we identify a fundamental quantity called grouping sensitivity that underpins the difficulty of recovering the true groups. We show that $L_0$-Fusion achieves grouping consistency under the weakest possible requirement of the grouping sensitivity: if this requirement is violated, then the minimax risk of group misspecification will fail to converge to zero. Moreover, we show that in the high-dimensional regime, one can apply $L_0$-Fusion coupled with a sure screening set of features without any essential loss of statistical efficiency, while reducing the computational cost substantially. On the algorithmic aspect, we provide a MIO formulation for $L_0$-Fusion along with a warm start strategy. Simulation and real data analysis demonstrate that $L_0$-Fusion exhibits superiority over its competitors in terms of grouping accuracy.
    Neural Myerson Auction for Truthful and Energy-Efficient Autonomous Aerial Data Delivery. (arXiv:2201.01170v1 [cs.GT])
    (2 min) A successful deployment of drones provides an ideal solution for surveillance systems. Using drones for surveillance can provide access to areas that may be difficult or impossible to reach by humans or in-land vehicles gathering images or video recordings of a specific target in their coverage. Therefore, we introduces a data delivery drone to transfer collected surveillance data in harsh communication conditions. This paper proposes a Myerson auction-based asynchronous data delivery in an aerial distributed data platform in surveillance systems taking battery limitation and long flight constraints into account. In this paper, multiple delivery drones compete to offer data transfer to a single fixed-location surveillance drone. Our proposed Myerson auction-based algorithm, which uses the truthful second-price auction (SPA) as a baseline, is to maximize the seller's revenue while meeting several desirable properties, i.e., individual rationality and incentive compatibility while pursuing truthful operations. On top of these SPA-based operations, a deep learning-based framework is additionally designed for delivery performance improvements.
    An unfeasability view of neural network learning. (arXiv:2201.00945v1 [cs.LG])
    (2 min) We define the notion of a continuously differentiable perfect learning algorithm for multilayer neural network architectures and show that such algorithms don't exist provided that the length of the data set exceeds the number of involved parameters and the activation functions are logistic, tanh or sin.
    Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting. (arXiv:2106.09276v2 [stat.ML] UPDATED)
    (2 min) We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.
    Deep neural networks for smooth approximation of physics with higher order and continuity B-spline base functions. (arXiv:2201.00904v1 [math.NA])
    (2 min) This paper deals with the following important research question. Traditionally, the neural network employs non-linear activation functions concatenated with linear operators to approximate a given physical phenomenon. They "fill the space" with the concatenations of the activation functions and linear operators and adjust their coefficients to approximate the physical phenomena. We claim that it is better to "fill the space" with linear combinations of smooth higher-order B-splines base functions as employed by isogeometric analysis and utilize the neural networks to adjust the coefficients of linear combinations. In other words, the possibilities of using neural networks for approximating the B-spline base functions' coefficients and by approximating the solution directly are evaluated. Solving differential equations with neural networks has been proposed by Maziar Raissi et al. in 2017 by introducing Physics-informed Neural Networks (PINN), which naturally encode underlying physical laws as prior information. Approximation of coefficients using a function as an input leverages the well-known capability of neural networks being universal function approximators. In essence, in the PINN approach the network approximates the value of the given field at a given point. We present an alternative approach, where the physcial quantity is approximated as a linear combination of smooth B-spline basis functions, and the neural network approximates the coefficients of B-splines. This research compares results from the DNN approximating the coefficients of the linear combination of B-spline basis functions, with the DNN approximating the solution directly. We show that our approach is cheaper and more accurate when approximating smooth physical fields.
    Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training. (arXiv:2103.00673v2 [cs.CV] UPDATED)
    (2 min) Normalization techniques have become a basic component in modern convolutional neural networks (ConvNets). In particular, many recent works demonstrate that promoting the orthogonality of the weights helps train deep models and improve robustness. For ConvNets, most existing methods are based on penalizing or normalizing weight matrices derived from concatenating or flattening the convolutional kernels. These methods often destroy or ignore the benign convolutional structure of the kernels; therefore, they are often expensive or impractical for deep ConvNets. In contrast, we introduce a simple and efficient "Convolutional Normalization" (ConvNorm) method that can fully exploit the convolutional structure in the Fourier domain and serve as a simple plug-and-play module to be conveniently incorporated into any ConvNets. Our method is inspired by recent work on preconditioning methods for convolutional sparse coding and can effectively promote each layer's channel-wise isometry. Furthermore, we show that our ConvNorm can reduce the layerwise spectral norm of the weight matrices and hence improve the Lipschitzness of the network, leading to easier training and improved robustness for deep ConvNets. Applied to classification under noise corruptions and generative adversarial network (GAN), we show that the ConvNorm improves the robustness of common ConvNets such as ResNet and the performance of GAN. We verify our findings via numerical experiments on CIFAR and ImageNet.
    APReL: A Library for Active Preference-based Reward Learning Algorithms. (arXiv:2108.07259v2 [cs.LG] UPDATED)
    (2 min) Reward learning is a fundamental problem in human-robot interaction to have robots that operate in alignment with what their human user wants. Many preference-based learning algorithms and active querying techniques have been proposed as a solution to this problem. In this paper, we present APReL, a library for active preference-based reward learning algorithms, which enable researchers and practitioners to experiment with the existing techniques and easily develop their own algorithms for various modules of the problem. APReL is available at https://github.com/Stanford-ILIAD/APReL.
    Deriving discriminative classifiers from generative models. (arXiv:2201.00844v1 [stat.ML])
    (2 min) We deal with Bayesian generative and discriminative classifiers. Given a model distribution $p(x, y)$, with the observation $y$ and the target $x$, one computes generative classifiers by firstly considering $p(x, y)$ and then using the Bayes rule to calculate $p(x | y)$. A discriminative model is directly given by $p(x | y)$, which is used to compute discriminative classifiers. However, recent works showed that the Bayesian Maximum Posterior classifier defined from the Naive Bayes (NB) or Hidden Markov Chain (HMC), both generative models, can also match the discriminative classifier definition. Thus, there are situations in which dividing classifiers into "generative" and "discriminative" is somewhat misleading. Indeed, such a distinction is rather related to the way of computing classifiers, not to the classifiers themselves. We present a general theoretical result specifying how a generative classifier induced from a generative model can also be computed in a discriminative way from the same model. Examples of NB and HMC are found again as particular cases, and we apply the general result to two original extensions of NB, and two extensions of HMC, one of which being original. Finally, we shortly illustrate the interest of the new discriminative way of computing classifiers in the Natural Language Processing (NLP) framework.
    Low dosage 3D volume fluorescence microscopy imaging using compressive sensing. (arXiv:2201.00820v1 [eess.IV])
    (2 min) Fluorescence microscopy has been a significant tool to observe long-term imaging of embryos (in vivo) growth over time. However, cumulative exposure is phototoxic to such sensitive live samples. While techniques like light-sheet fluorescence microscopy (LSFM) allow for reduced exposure, it is not well suited for deep imaging models. Other computational techniques are computationally expensive and often lack restoration quality. To address this challenge, one can use various low-dosage imaging techniques that are developed to achieve the 3D volume reconstruction using a few slices in the axial direction (z-axis); however, they often lack restoration quality. Also, acquiring dense images (with small steps) in the axial direction is computationally expensive. To address this challenge, we present a compressive sensing (CS) based approach to fully reconstruct 3D volumes with the same signal-to-noise ratio (SNR) with less than half of the excitation dosage. We present the theory and experimentally validate the approach. To demonstrate our technique, we capture a 3D volume of the RFP labeled neurons in the zebrafish embryo spinal cord (30um thickness) with the axial sampling of 0.1um using a confocal microscope. From the results, we observe the CS-based approach achieves accurate 3D volume reconstruction from less than 20% of the entire stack optical sections. The developed CS-based methodology in this work can be easily applied to other deep imaging modalities such as two-photon and light-sheet microscopy, where reducing sample photo-toxicity is a critical challenge.
    Generating synthetic mobility data for a realistic population with RNNs to improve utility and privacy. (arXiv:2201.01139v1 [cs.LG])
    (2 min) Location data collected from mobile devices represent mobility behaviors at individual and societal levels. These data have important applications ranging from transportation planning to epidemic modeling. However, issues must be overcome to best serve these use cases: The data often represent a limited sample of the population and use of the data jeopardizes privacy. To address these issues, we present and evaluate a system for generating synthetic mobility data using a deep recurrent neural network (RNN) which is trained on real location data. The system takes a population distribution as input and generates mobility traces for a corresponding synthetic population. Related generative approaches have not solved the challenges of capturing both the patterns and variability in individuals' mobility behaviors over longer time periods, while also balancing the generation of realistic data with privacy. Our system leverages RNNs' ability to generate complex and novel sequences while retaining patterns from training data. Also, the model introduces randomness used to calibrate the variation between the synthetic and real data at the individual level. This is to both capture variability in human mobility, and protect user privacy. Location based services (LBS) data from more than 22,700 mobile devices were used in an experimental evaluation across utility and privacy metrics. We show the generated mobility data retain the characteristics of the real data, while varying from the real data at the individual level, and where this amount of variation matches the variation within the real data.
  • stat.ML updates on arXiv.org

    Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks. (arXiv:2110.03825v4 [cs.LG] UPDATED)
    (2 min) Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks. A range of defense methods have been proposed to train adversarially robust DNNs, among which adversarial training has demonstrated promising results. However, despite preliminary understandings developed for adversarial training, it is still not clear, from the architectural perspective, what configurations can lead to more robust DNNs. In this paper, we address this gap via a comprehensive investigation on the impact of network width and depth on the robustness of adversarially trained DNNs. Specifically, we make the following key observations: 1) more parameters (higher model capacity) does not necessarily help adversarial robustness; 2) reducing capacity at the last stage (the last group of blocks) of the network can actually improve adversarial robustness; and 3) under the same parameter budget, there exists an optimal architectural configuration for adversarial robustness. We also provide a theoretical analysis explaning why such network configuration can help robustness. These architectural insights can help design adversarially robust DNNs. Code is available at \url{https://github.com/HanxunH/RobustWRN}.
    Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting. (arXiv:2106.09276v2 [stat.ML] UPDATED)
    (2 min) We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.
    Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections. (arXiv:2106.15427v2 [stat.ML] UPDATED)
    (2 min) The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem.
    Discovering Diverse Nearly Optimal Policies with Successor Features. (arXiv:2106.00669v2 [cs.AI] UPDATED)
    (2 min) Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.
    Variance-Aware Off-Policy Evaluation with Linear Function Approximation. (arXiv:2106.11960v2 [cs.LG] UPDATED)
    (2 min) We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
    Supervised Homogeneity Fusion: a Combinatorial Approach. (arXiv:2201.01036v1 [stat.ML])
    (2 min) Fusing regression coefficients into homogenous groups can unveil those coefficients that share a common value within each group. Such groupwise homogeneity reduces the intrinsic dimension of the parameter space and unleashes sharper statistical accuracy. We propose and investigate a new combinatorial grouping approach called $L_0$-Fusion that is amenable to mixed integer optimization (MIO). On the statistical aspect, we identify a fundamental quantity called grouping sensitivity that underpins the difficulty of recovering the true groups. We show that $L_0$-Fusion achieves grouping consistency under the weakest possible requirement of the grouping sensitivity: if this requirement is violated, then the minimax risk of group misspecification will fail to converge to zero. Moreover, we show that in the high-dimensional regime, one can apply $L_0$-Fusion coupled with a sure screening set of features without any essential loss of statistical efficiency, while reducing the computational cost substantially. On the algorithmic aspect, we provide a MIO formulation for $L_0$-Fusion along with a warm start strategy. Simulation and real data analysis demonstrate that $L_0$-Fusion exhibits superiority over its competitors in terms of grouping accuracy.
    Self-supervised representation learning from 12-lead ECG data. (arXiv:2103.12676v2 [eess.SP] UPDATED)
    (2 min) Clinical 12-lead electrocardiography (ECG) is one of the most widely encountered kinds of biosignals. Despite the increased availability of public ECG datasets, label scarcity remains a central challenge in the field. Self-supervised learning represents a promising way to alleviate this issue. In this work, we put forward the first comprehensive assessment of self-supervised representation learning from clinical 12-lead ECG data. To this end, we adapt state-of-the-art self-supervised methods based on instance discrimination and latent forecasting to the ECG domain. In a first step, we learn contrastive representations and evaluate their quality based on linear evaluation performance on a recently established, comprehensive, clinical ECG classification task. In a second step, we analyze the impact of self-supervised pretraining on finetuned ECG classifiers as compared to purely supervised performance. For the best-performing method, an adaptation of contrastive predictive coding, we find a linear evaluation performance only 0.5% below supervised performance. For the finetuned models, we find improvements in downstream performance of roughly 1% compared to supervised performance, label efficiency, as well as robustness against physiological noise. This work clearly establishes the feasibility of extracting discriminative representations from ECG data via self-supervised learning and the numerous advantages when finetuning such representations on downstream tasks as compared to purely supervised training. As first comprehensive assessment of its kind in the ECG domain carried out exclusively on publicly available datasets, we hope to establish a first step towards reproducible progress in the rapidly evolving field of representation learning for biosignals.
    Kendall transformation: a robust representation of continuous data for information theory. (arXiv:2006.15991v2 [cs.LG] UPDATED)
    (2 min) Kendall transformation is a conversion of an ordered feature into a vector of pairwise order relations between individual values. This way, it preserves ranking of observations and represents it in a categorical form. Such transformation allows for generalisation of methods requiring strictly categorical input, especially in the limit of small number of observations, when discretisation becomes problematic. In particular, many approaches of information theory can be directly applied to Kendall-transformed continuous data without relying on differential entropy or any additional parameters. Moreover, by filtering information to this contained in ranking, Kendall transformation leads to a better robustness at a reasonable cost of dropping sophisticated interactions which are anyhow unlikely to be correctly estimated. In bivariate analysis, Kendall transformation can be related to popular non-parametric methods, showing the soundness of the approach. The paper also demonstrates its efficiency in multivariate problems, as well as provides an example analysis of a real-world data.
    Deriving discriminative classifiers from generative models. (arXiv:2201.00844v1 [stat.ML])
    (2 min) We deal with Bayesian generative and discriminative classifiers. Given a model distribution $p(x, y)$, with the observation $y$ and the target $x$, one computes generative classifiers by firstly considering $p(x, y)$ and then using the Bayes rule to calculate $p(x | y)$. A discriminative model is directly given by $p(x | y)$, which is used to compute discriminative classifiers. However, recent works showed that the Bayesian Maximum Posterior classifier defined from the Naive Bayes (NB) or Hidden Markov Chain (HMC), both generative models, can also match the discriminative classifier definition. Thus, there are situations in which dividing classifiers into "generative" and "discriminative" is somewhat misleading. Indeed, such a distinction is rather related to the way of computing classifiers, not to the classifiers themselves. We present a general theoretical result specifying how a generative classifier induced from a generative model can also be computed in a discriminative way from the same model. Examples of NB and HMC are found again as particular cases, and we apply the general result to two original extensions of NB, and two extensions of HMC, one of which being original. Finally, we shortly illustrate the interest of the new discriminative way of computing classifiers in the Natural Language Processing (NLP) framework.
    Open Access Dataset for Electromyography based Multi-code Biometric Authentication. (arXiv:2201.01051v1 [cs.CR])
    (2 min) Recently, surface electromyogram (EMG) has been proposed as a novel biometric trait for addressing some key limitations of current biometrics, such as spoofing and liveness. The EMG signals possess a unique characteristic: they are inherently different for individuals (biometrics), and they can be customized to realize multi-length codes or passwords (for example, by performing different gestures). However, current EMG-based biometric research has two critical limitations: 1) a small subject pool, compared to other more established biometric traits, and 2) limited to single-session or single-day data sets. In this study, forearm and wrist EMG data were collected from 43 participants over three different days with long separation while they performed static hand and wrist gestures. The multi-day biometric authentication resulted in a median EER of 0.017 for the forearm setup and 0.025 for the wrist setup, comparable to well-established biometric traits suggesting consistent performance over multiple days. The presented large-sample multi-day data set and findings could facilitate further research on EMG-based biometrics and other gesture recognition-based applications.
    Neural Estimation of Statistical Divergences. (arXiv:2110.03652v2 [math.ST] UPDATED)
    (2 min) Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there is a fundamental tradeoff between the two sources of error involved: approximation and empirical estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. We explore this tradeoff for an estimator based on a shallow NN by means of non-asymptotic error bounds, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. The bounds reveal the tension between the NN size and the number of samples, and enable to characterize scaling rates thereof that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are near minimax rate-optimal, achieving the parametric rate up to logarithmic factors.
    Optimal design of the Barker proposal and other locally-balanced Metropolis-Hastings algorithms. (arXiv:2201.01123v1 [stat.CO])
    (2 min) We study the class of first-order locally-balanced Metropolis--Hastings algorithms introduced in Livingstone & Zanella (2021). To choose a specific algorithm within the class the user must select a balancing function $g:\mathbb{R} \to \mathbb{R}$ satisfying $g(t) = tg(1/t)$, and a noise distribution for the proposal increment. Popular choices within the class are the Metropolis-adjusted Langevin algorithm and the recently introduced Barker proposal. We first establish a universal limiting optimal acceptance rate of 57% and scaling of $n^{-1/3}$ as the dimension $n$ tends to infinity among all members of the class under mild smoothness assumptions on $g$ and when the target distribution for the algorithm is of the product form. In particular we obtain an explicit expression for the asymptotic efficiency of an arbitrary algorithm in the class, as measured by expected squared jumping distance. We then consider how to optimise this expression under various constraints. We derive an optimal choice of noise distribution for the Barker proposal, optimal choice of balancing function under a Gaussian noise distribution, and optimal choice of first-order locally-balanced algorithm among the entire class, which turns out to depend on the specific target distribution. Numerical simulations confirm our theoretical findings and in particular show that a bi-modal choice of noise distribution in the Barker proposal gives rise to a practical algorithm that is consistently more efficient than the original Gaussian version.
    Learning to Play No-Press Diplomacy with Best Response Policy Iteration. (arXiv:2006.04635v4 [cs.LG] UPDATED)
    (2 min) Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.
    Descriptive vs. inferential community detection: pitfalls, myths and half-truths. (arXiv:2112.00183v3 [physics.soc-ph] UPDATED)
    (2 min) Community detection is one of the most important methodological fields of network science, and one which has attracted a significant amount of attention over the past decades. This area deals with the automated division of a network into fundamental building blocks, with the objective of providing a summary of its large-scale structure. Despite its importance and widespread adoption, there is a noticeable gap between what is considered the state-of-the-art and the methods that are actually used in practice in a variety of fields. Here we attempt to address this discrepancy by dividing existing methods according to whether they have a "descriptive" or an "inferential" goal. While descriptive methods find patterns in networks based on intuitive notions of community structure, inferential methods articulate a precise generative model, and attempt to fit it to data. In this way, they are able to provide insights into the mechanisms of network formation, and separate structure from randomness in a manner supported by statistical evidence. We review how employing descriptive methods with inferential aims is riddled with pitfalls and misleading answers, and thus should be in general avoided. We argue that inferential methods are more typically aligned with clearer scientific questions, yield more robust results, and should be in many cases preferred. We attempt to dispel some myths and half-truths often believed when community detection is employed in practice, in an effort to improve both the use of such methods as well as the interpretation of their results.
    On the effectiveness of adversarial training against common corruptions. (arXiv:2103.02325v2 [cs.LG] UPDATED)
    (2 min) The literature on robustness towards common corruptions shows no consensus on whether adversarial training can improve the performance in this setting. First, we show that, when used with an appropriately selected perturbation radius, $\ell_p$ adversarial training can serve as a strong baseline against common corruptions improving both accuracy and calibration. Then we explain why adversarial training performs better than data augmentation with simple Gaussian noise which has been observed to be a meaningful baseline on common corruptions. Related to this, we identify the $\sigma$-overfitting phenomenon when Gaussian augmentation overfits to a particular standard deviation used for training which has a significant detrimental effect on common corruption accuracy. We discuss how to alleviate this problem and then how to further enhance $\ell_p$ adversarial training by introducing an efficient relaxation of adversarial training with learned perceptual image patch similarity as the distance metric. Through experiments on CIFAR-10 and ImageNet-100, we show that our approach does not only improve the $\ell_p$ adversarial training baseline but also has cumulative gains with data augmentation methods such as AugMix, DeepAugment, ANT, and SIN, leading to state-of-the-art performance on common corruptions. The code of our experiments is publicly available at https://github.com/tml-epfl/adv-training-corruptions.
    Biased Hypothesis Formation From Projection Pursuit. (arXiv:2201.00889v1 [cs.LG])
    (2 min) The effect of bias on hypothesis formation is characterized for an automated data-driven projection pursuit neural network to extract and select features for binary classification of data streams. This intelligent exploratory process partitions a complete vector state space into disjoint subspaces to create working hypotheses quantified by similarities and differences observed between two groups of labeled data streams. Data streams are typically time sequenced, and may exhibit complex spatio-temporal patterns. For example, given atomic trajectories from molecular dynamics simulation, the machine's task is to quantify dynamical mechanisms that promote function by comparing protein mutants, some known to function while others are nonfunctional. Utilizing synthetic two-dimensional molecules that mimic the dynamics of functional and nonfunctional proteins, biases are identified and controlled in both the machine learning model and selected training data under different contexts. The refinement of a working hypothesis converges to a statistically robust multivariate perception of the data based on a context-dependent perspective. Including diverse perspectives during data exploration enhances interpretability of the multivariate characterization of similarities and differences.
  • The Official NVIDIA Blog

    Prepare for Genshin Impact, Coming to GeForce NOW in Limited Beta
    (4 min) GeForce NOW is charging into the new year at full force. This GFN Thursday comes with the news that Genshin Impact, the popular open-world action role-playing game, will be coming to the cloud this year, arriving in a limited beta. Plus, this year’s CES announcements were packed with news for GeForce NOW. Battlefield 4: Premium Read article > The post Prepare for Genshin Impact, Coming to GeForce NOW in Limited Beta appeared first on The Official NVIDIA Blog.
  • Blog – Machine Learning Mastery

    Python debugging tools
    (16 min) In all programming exercises, it is difficult to go far and deep without a handy debugger. The built-in debugger, pdb, […] The post Python debugging tools appeared first on Machine Learning Mastery.
  • Artificial Intelligence

    Meta AI and CMU Researchers Present ‘BANMo’: A New Neural Network-Based Method To Build Animatable 3D Models From Videos
    (1 min) Previous work on articulated 3D shape reconstruction has frequently relied on specialized sensors (e.g., synchronized multi-camera systems) or pre-built 3D deformable models (e.g., SMAL or SMPL). Such approaches cannot scale to a wide range of items in the wild. BANMo is a technique that does not need a specialized sensor or a pre-defined template form. In a differentiable rendering framework, BANMo generates high-fidelity, articulated 3D models (including state and animatable skinning weights) from a large number of monocular casual films. While the usage of several films increases coverage of camera perspectives and object articulations, it introduces significant issues in establishing correlation across scenes with diverse backdrops, lighting conditions, etc. Continue Reading Paper: https://arxiv.org/pdf/2112.12761.pdf Project: https://banmo-www.github.io/ ​ https://reddit.com/link/rxlq17/video/7a0w2v9u44a81/player submitted by /u/ai-lover [link] [comments]
    [R] University of Amsterdam & Meta AI Propose a Roadmap Toward Interactive Language Modelling Based on Caregiver-Child Interactions
    (1 min) In the new paper Towards Interactive Language Modeling, a research team from the University of Amsterdam and Meta AI Labs presents a road map detailing the steps to be taken towards interactive language modelling. Here is a quick read: University of Amsterdam & Meta AI Propose a Roadmap Toward Interactive Language Modelling Based on Caregiver-Child Interactions. The paper Towards Interactive Language Modelling is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    What is DeepMind's Gopher?
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    Reinforcement learning for the real world
    (0 min) submitted by /u/bendee983 [link] [comments]
    SaneBox | Clean up your inbox in minutes & keep it that way forever
    (0 min) submitted by /u/belqassim [link] [comments]
    Last Week in AI - AI enables brain interface for robot control, Deep Learning suffers from overinterpretation, and more!
    (0 min) submitted by /u/regalalgorithm [link] [comments]
    Where can I find AIs which are able to edit images automatically?
    (1 min) Like filters or even better. submitted by /u/xXLisa28Xx [link] [comments]
    We Are Miracles Made of the Cosmic Sea of Miracles
    (1 min) submitted by /u/sanguineon [link] [comments]
    State of art in Knowledge representation of image and text pair in AI model
    (1 min) If there is a training data which consists of pair of image and corresponding textual explanation of that image. How to represent this Semantic knowledge between image and corresponding textual meaning in AI model. So that we don't need much of training data, if I can infuse that knowledge between image and text pair. Idea is to infuse this knowledge and then at inference time, if such image appears then perform action. Ex- If there is a image and respective meaning is 'open the door', then if robot sees that image at inference it will open the door. I am thinking of NLP based model to infuse this knowledge, but not sure how BERT like model can be employed in this situation. Is there existing SOTA which addresses this kind of issues? submitted by /u/projekt_treadstone [link] [comments]
    Researchers From Stanford and NVIDIA Introduce A Tri-Plane-Based 3D GAN Framework To Enable High-Resolution Geometry-Aware Image Synthesis
    (1 min) Generative Adversarial Networks (GANs) have been one of the main hypes of recent years. Based on the famous generator-discriminator mechanism, their very simple functioning has driven the research to continuously improve the former architecture. The peak in image generation has been reached by StyleGANs, which can produce astonishingly realistic and high-quality images, able to fool even humans. While the generation of new samples has achieved excellent results in the 2D domain, 3D GANs are still highly inefficient. If the exact mechanism of 2D GANs is applied in the 3D environment, the computational effort is too high since 3D data is tough to manipulate for current GPUs. For this reason, the research has focused on how to construct geometry-aware GANs that can infer the underline 3D property using solely 2D images. But, in this case, the approximations are usually not 3D consistent. Continue Reading The Paper Summary Paper: https://arxiv.org/pdf/2112.07945.pdf Project: https://matthew-a-chan.github.io/EG3D/ ​ https://reddit.com/link/rx5bpe/video/bdca5nd9uz981/player submitted by /u/ai-lover [link] [comments]
  • Machine Learning

    [D] Turning petabyte-scale video data into great datasets
    (1 min) Imagine 10 cameras, running 24/7 at 30 FPS - that's 27 million frames generated in a single day. Knowing what's in that data or finding the 1% of things that are actually interesting is hard. I genuinely think there's way too little attention put forth to how people should use their raw data effectively even though so many people choose to store petabytes of it, just in case. I wrote a little article about taking raw video and turning that into an actionable computer vision model. Would love to have a technical discussion about this so comment away :) https://towardsdatascience.com/curating-a-dataset-from-raw-images-and-videos-c8b962eca9ba P.S. Yes, I'm also the OP of the post showcasing Sieve from a few days ago! submitted by /u/happybirthday290 [link] [comments]
    [Project] Guidance on Key Information Extraction for financial statements
    (2 min) Hello all, I'm currently working on a project where I'm trying to perform key information extraction (KIE) and classification on rental property operating statements. My plan is to pre-train a model on a similar dataset (e.g. [IBM FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/), etc.) and then fine-tune on a custom dataset I'm creating. My task is to: Identify the key:value pairs associated with each line item for each period in the statement. So for example, for the first line item, we'd have a 3-tuple of ('Rent income', 'Nov 2020', 21,428.03) Identify the key:value pairs associated with the total for each line item in the statement. So for example, for the first line item, we'd have a 3-tuple of ('Rent income', 'Total', 322,872.36) Automatically classify line it…
    [D] Sparsest Cut in practice?
    (1 min) Sparsest cut, and its close variant "balanced cut", have been studied for ages in the computer science theory community, and there are algorithms to compute this which run fast. Sparsest cut is often billed as a "graph clustering" algorithm. Sparsest cut gives rise to an obvious data clustering algorithm: build a graph on your data (like kNN or threshold graph), and run sparsest cut to get a 2-way partition. To do a k-way partition, you could recursively do this. From what I know, sparsest cut intuition is one thing that guided the creation of spectral clustering, a widely used clustering method. HHas anyone tried implementing this in practice? If so, how does it do on data? If not, why not? I have searched extensively on Google for any implementations of this sparsest-cut style clustering, but I haven't found any. Given that sparsest cut is one of the most well-studied problems in computer theory, I was surprised that no implementation of this clustering method exists. submitted by /u/Grand_Distribution83 [link] [comments]
    [D] Any measurement studies about MLaaS monetisation
    (1 min) ML as a Service appears to be the new hype, but I cant quite figure out how such technologies are monetised. GPT-3 appears to have attracted quite a lot of clients, but is it profitable at all? What do the incentives look like it this market? Are model extraction attacks a problem here at all? How does one protect their IP? I heard of people watermarking their models, but I am not sure how such techniques in their current form would fare in court. submitted by /u/SuchOccasion457 [link] [comments]
    [D] Yolov5 TTA slower on smaller image
    (1 min) Hi all, I'm using a Yolov5 model to make inferences and noticed that it was faster at inferring 320x320 images (~0.25s/image) compared to 288x288 images (~0.45s/image). When I disabled test time augmentation, the gap closed, and both image sizes are inferred at ~0.084s/image. Any ideas why augmenting a smaller image size is slower than augmenting a faster image size? submitted by /u/queue_learning [link] [comments]
    [N] MDLI Ops – A free conference to help you make sense of the MLOps landscape
    (2 min) Disclaimer: Not my conference, but a good friend's work. He's doing an awesome job community building and I thought this might interest the community here as well (I am not a sponsor). Hey r/ML, some of you might have heard about MDLI – short for Machine & Deep Learning Israel. It's an independent Israeli community (over 25K members) for professionals in data science and ML, and they are having a conference for everyone (in English), in just 14 days. You can register here https://machinelearning.co.il/mdli-ops-2022/ I know that sometimes these free events tend to feel like commercials, but you should really check out the agenda. Some of the talks are going to be by ML teams at companies like AppsFlyer and BigPanda, explaining how they built their internal stacks and ML systems. There's also a super interesting talk by the NVIDIA team, where they will talk about building supercomputers to train ML models on them. This one seems crazy awesome to me. It's also a good opportunity to listen to offerings by some cool ML startups, which might give you a better understanding of how they compare, and what you actually care about when choosing MLOps tools. The organizers will probably be here in the comments if you have any questions, but in my opinion, every event this community organizes is really awesome, and I get to learn a lot, so I really recommend this. submitted by /u/PhYsIcS-GUY227 [link] [comments]
    [P] Machine Learning Engineering Conferences
    (1 min) I was wondering if anyone could recommend and conferences suitable for MLE’s, maybe with a focus on deploying models. I’ve been to traditional data science focused one, such as the Anaconda one. Though I wanted to see if any existed that were more closely connected to deployment. submitted by /u/MenArePigs69 [link] [comments]
    [P] Deepchecks: an open-source tool for high standards validations for ML models and data.
    (2 min) Hey everyone! I wanted to share with you an open-source tool we've been building for a while. Deepchecks is an open-source tool for validating & testing models and data efficiently: https://github.com/deepchecks/deepchecks Deepchecks is a python package, implementing validations and tests needed in order to trust an ML pipeline. It contains many built-in checks, such as verifying the data integrity, inspecting its distributions, validating data splits, evaluating your model, and comparing between different models. In addition, it contains test suites, similar to the test suites in software programs, that can accompany you through all building blocks of the ML pipeline development. Each test suite contains checks necessary for the specific part in the pipeline. The suite result looks something like this: ​ Suite result The suites and checks have a simple syntax and are highly customizable. If you want to jump right in, you can try it out in the quick start notebook: https://docs.deepchecks.com/en/stable/examples/guides/quickstart_in_5_minutes.html What do you think? I’ll be happy to hear your thoughts and feedback. submitted by /u/EuphoricMeal8344 [link] [comments]
    Relationship extraction for knowledge graph creation from biomedical literature
    (0 min) submitted by /u/dreadknight011 [link] [comments]
    [D] Retrieval transformers for other domains?
    (1 min) I am just going through the research behind this: http://jalammar.github.io/illustrated-retrieval-transformer/ and was wondering if there has been work like this for other domains say computer vision for example? submitted by /u/data-drone [link] [comments]
    [D] How to correctly use batch norm in pre-loaded architectures?
    (1 min) I have tried using a few pre-loaded architectures in Tensorflow including tf.keras.applications.efficientnet.EfficientNetB3 and tf.keras.applications.MobileNetV3Large . I am working on medical images so I am not any pretrained ImageNet weights. I manage to reach a decent accuracy during the training epoch (eg 75%), but my validation accuracy doesn't cross random performance (eg 3%). Upon closer inspection I have identified the cause as being the batch normalization layers - If I run the evaluation batches w/ training=True on the BatchNorm layers, I am able to reproduce the training set accuracy. I have tried playing with the momentum parameter and changing it from its default value of .999 to .75 or even .5 but with no effect. My question is what may be causing this and how can I fix it? On a larger note, do most people find that they can use pre-loaded architectures out-of-the-box or are there various parameters that they need to modify to get it to work properly? submitted by /u/rsandler [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Can a neural net be trained to create badass artwork from pinterest?
    (2 min) Example: If i trained a neural net on my own pinterest board full of manga/illustration style artwork - with a rough theme (fantasy/sci-fi/wizard type characters) could the neural net then be able to output random variations of these? If no - what would it take? Sorry if im thinking about this the wrong way, i'm still new to this. submitted by /u/skittleteeth [link] [comments]
    Is this the right use of an ANN?
    (1 min) I have made rudimentary ANN before but keep looking for ways to push my boundaries. I like AI and flight sims so why not an AI that helps me flight sim. :) I'm thinking of an ANN assistant (eg co-pilot) that learns to do key actions during flight. Inputs: On ground (Boolean) Altitude (numerical - normalised) Pitch (nose up/down/level) Outputs: Gear (up/down) Autopilot (on/off) Flaps (up/down) Lights (on/off) So, given the state of the aircraft at any given point, the ANN can manipulate (or recommend) the configuration of the aircraft and lighten the pilot work load (remembering this is a fun exercise). Assume my ANN can monitor and manipulate the flight sim (it has an interface to do this), is an ANN with appropriate weights and nodes a good fit (after training)? Should I be looking at Q-Table or something else? submitted by /u/Togfox [link] [comments]
    3dCNN question help appreciated
    (1 min) Im new to 3dCNN. Is it possible to use 3dCNN for action classification (like sitting, walking?) I have temperature data in csv this data can be turned into 16x16x4 images (RBG to show temp in different colors). My question is, do i need to turn each frame into an image? To then be turned into a video in order to feed it into the 3dCNN? If so, how can i optimize this? 1 action has several frames in the same csv file, and I have several files. Or can i just use temperature data in csv straight away? (Each file contains one action with several frames worth of temp data) Any help /links/ examples are greatly appreciated. submitted by /u/Miki_mallow [link] [comments]
  • Reinforcement Learning

    Help with my PyTorch implementation of PPO
    (1 min) Hi all, I implemented PPO using PyTorch here. As is suggested, I was trying it on very simple environment (CartPole-v1). My PPO doesn't seem to be very impressive. The reward seems to saturate somewhere around 25, while the best it can go is around 190s. Here is benchmark of all my experiments. There exist various implementations with various set of hyperparameters, and other improvement choices (Advantage using GAE, Clip critic loss, and so on). I have tried various mix of them, but sadly nothing seems to improve it. The various resources I have referred are: Phil's video: https://youtu.be/hlv79rcHws0video by wandb: https://youtu.be/MEt6rrxH8W4Hyperparameters Deep Reinforcement Learning that Matters: https://arxiv.org/pdf/1709.06560.pdf IMPLEMENTATION MATTERS IN DEEP POLICY GRADIENTS: A CASE STUDY ON PPO AND TRPO: https://openreview.net/attachment?id=r1etN1rtPB&name=original_pdf Can you help me getting it trained on this simple environment.I have implemented it in a way that it should be easy to run and understand. submitted by /u/harshraj22 [link] [comments]
    Normalizing advantage estimates in PPO
    (1 min) Hello everybody! One implementation detail of PPO is to normalize the advantage estimates. These are usually normalized using the mean and std so that the mean is 0 and the std is 1. Now, I came across 3 different implementations that apply the normalization to different amounts of data. More precisely: Normalizing advantages across single episodes (kamarl) Normalizing advantages across single mini batches (neroRL) Normalizing advantages across the entire batch (SpinningUp) Do you guys think that this detail has a higher impact? My first intuition is that the more data is used for normalization, the smoother the normalized advantages are. So when normalizing across single episodes, this could lead to a high variance causing harmfully large parameter updates, whereas the smooth normalization could hinder learning. submitted by /u/LilHairdy [link] [comments]
    Would you fine tune the hyperparameters or modify the reward shaping (customized env) to improve the performance of model-free algorithms like TD3?
    (1 min) submitted by /u/Rich_Beautiful [link] [comments]
    Is Reinforcement Learning just a special case of causal modeling?
    (6 min) Hi everyone - I am a data scientist who works a lot on causal inference problems. In particular, I build individual treatment effect (AKA Conditional Average Treatment Effects AKA Uplift) models to estimate the causal effect of some potential treatment(s) on an outcome of interest for an individual for the express purpose of giving the optimal "treatment" to a given individual. "Optimal" usually means the treatment that is estimated to cause the largest positive effect on the outcome. A typical example is optimizing marketing touches for individual customers: we have several "treatments" we can apply to a customer at a given time, such as targeting them for an online ad campaign, sending a push notification through our app, sending a marketing email, etc. and we want to make sure we are pu…
    Analyzing Skill Expression in Games (with the help of reinforcement learning)
    (17 min) submitted by /u/thechiamp [link] [comments]

2022-01-05

  • cs.LG updates on arXiv.org

    Identify comorbidities associated with recurrent ED and in-patient visits. (arXiv:2110.13769v2 [stat.ML] UPDATED)
    (0 min) In the hospital setting, a small percentage of recurrent frequent patients contribute to a disproportional amount of healthcare resource usage. Moreover, in many of these cases, patient outcomes can be greatly improved by reducing reoccurring visits, especially when they are associated with substance abuse, mental health, and medical factors that could be improved by social-behavioral interventions, outpatient or preventative care. Additionally, health care costs can be reduced significantly with fewer preventable recurrent visits. To address this, we developed a computationally efficient and interpretable framework that both identifies recurrent patients with high utilization and determines which comorbidities contribute most to their recurrent visits. Specifically, we present a novel algorithm, called the minimum similarity association rules (MSAR), balancing confidence-support trade-off, to determine the conditions most associated with reoccurring Emergency department (ED) and inpatient visits. We validate MSAR on a large Electric Health Record (EHR) dataset.
    GCN-Geo: A Graph Convolution Network-based Fine-grained IP Geolocation System. (arXiv:2112.10767v4 [cs.LG] UPDATED)
    (0 min) Fine-grained IP geolocation systems often rely on some linear delay-distance rules. They are not easy to generalize in network environments where the delay-distance relationship is non-linear. Recently, researchers begin to pay attention to learning-based IP geolocation systems. These data-driven algorithms leverage multi-layer perceptron (MLP) to model non-linear relationships. However, MLP is not so suitable for modeling computer networks because networks are fundamentally graph-typed data. MLP-based IP geolocation systems only treat IP addresses as isolated data instances, forgoing the connection information between IP addresses. This would lead to sub-optimal representations and limit the geolocation performance. Graph convolutional network (GCN) is an emerging deep learning method for graph-typed data presentation. In this work, we research how to model computer networks for fine-grained IP geolocation with GCN. First, we formulate the IP geolocation task as an attributed graph node regression problem. Then, a GCN-based IP geolocation system named GCN-Geo is proposed to predict the location of each IP address. GCN-Geo consists of a preprocessor, an encoder, graph convolutional (GC) layers and a decoder. The preprocessor and the encoder transform raw measurement data into the initial graph embeddings. GC layers refine the initial graph node embeddings by explicitly modeling the connection information between IP addresses. The proposed decoder can relieve the converging problem of GCN-Geo by considering some prior knowledge about target IP addresses. Finally, the experimental results in three real-world datasets show that: GCN-Geo clearly outperforms the state-of-art rule-based and learning-based baselines on all three datasets w.r.t. average, median and max error distances. This work verifies the potential of GCN in fine-grained IP geolocation.
    A Comprehensive Survey of Logging in Software: From Logging Statements Automation to Log Mining and Analysis. (arXiv:2110.12489v2 [cs.SE] UPDATED)
    (0 min) Logs are widely used to record runtime information of software systems, such as the timestamp and the importance of an event, the unique ID of the source of the log, and a part of the state of a task's execution. The rich information of logs enables system developers (and operators) to monitor the runtime behaviors of their systems and further track down system problems and perform analysis on log data in production settings. However, the prior research on utilizing logs is scattered and that limits the ability of new researchers in this field to quickly get to the speed and hampers currently active researchers to advance this field further. Therefore, this paper surveys and provides a systematic literature review and mapping of the contemporary logging practices and log statements' mining and monitoring techniques and their applications such as in system failure detection and diagnosis. We study a large number of conference and journal papers that appeared on top-level peer-reviewed venues. Additionally, we draw high-level trends of ongoing research and categorize publications into subdivisions. In the end, and based on our holistic observations during this survey, we provide a set of challenges and opportunities that will lead the researchers in academia and industry in moving the field forward.
    Fracture Detection in Wrist X-ray Images Using Deep Learning-Based Object Detection Models. (arXiv:2111.07355v2 [eess.IV] UPDATED)
    (0 min) Wrist fractures are common cases in hospitals, particularly in emergency services. Physicians need images from various medical devices, and patients medical history and physical examination to diagnose these fractures correctly and apply proper treatment. This study aims to perform fracture detection using deep learning on wrist Xray images to assist physicians not specialized in the field, working in emergency services in particular, in diagnosis of fractures. For this purpose, 20 different detection procedures were performed using deep learning based object detection models on dataset of wrist Xray images obtained from Gazi University Hospital. DCN, Dynamic R_CNN, Faster R_CNN, FSAF, Libra R_CNN, PAA, RetinaNet, RegNet and SABL deep learning based object detection models with various backbones were used herein. To further improve detection procedures in the study, 5 different ensemble models were developed, which were later used to reform an ensemble model to develop a detection model unique to our study, titled wrist fracture detection combo (WFD_C). Based on 26 different models for fracture detection, the highest result of detection was 0.8639 average precision (AP50) in WFD_C model developed. This study is supported by Huawei Turkey R&D Center within the scope of the ongoing cooperation project coded 071813 among Gazi University, Huawei and Medskor.
    SoK: Privacy-preserving Deep Learning with Homomorphic Encryption. (arXiv:2112.12855v2 [cs.CR] UPDATED)
    (0 min) Outsourced computation for neural networks allows users access to state of the art models without needing to invest in specialized hardware and know-how. The problem is that the users lose control over potentially privacy sensitive data. With homomorphic encryption (HE) computation can be performed on encrypted data without revealing its content. In this systematization of knowledge, we take an in-depth look at approaches that combine neural networks with HE for privacy preservation. We categorize the changes to neural network models and architectures to make them computable over HE and how these changes impact performance. We find numerous challenges to HE based privacy-preserving deep learning such as computational overhead, usability, and limitations posed by the encryption schemes.
    Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning. (arXiv:2112.04731v3 [cs.CV] UPDATED)
    (0 min) Class Incremental Learning (CIL) aims at learning a multi-class classifier in a phase-by-phase manner, in which only data of a subset of the classes are provided at each phase. Previous works mainly focus on mitigating forgetting in phases after the initial one. However, we find that improving CIL at its initial phase is also a promising direction. Specifically, we experimentally show that directly encouraging CIL Learner at the initial phase to output similar representations as the model jointly trained on all classes can greatly boost the CIL performance. Motivated by this, we study the difference between a na\"ively-trained initial-phase model and the oracle model. Specifically, since one major difference between these two models is the number of training classes, we investigate how such difference affects the model representations. We find that, with fewer training classes, the data representations of each class lie in a long and narrow region; with more training classes, the representations of each class scatter more uniformly. Inspired by this observation, we propose Class-wise Decorrelation (CwD) that effectively regularizes representations of each class to scatter more uniformly, thus mimicking the model jointly trained with all classes (i.e., the oracle model). Our CwD is simple to implement and easy to plug into existing methods. Extensive experiments on various benchmark datasets show that CwD consistently and significantly improves the performance of existing state-of-the-art methods by around 1\% to 3\%. Code will be released.
    DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning. (arXiv:2106.03760v3 [cs.LG] UPDATED)
    (0 min) The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to $128$ tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over $22\%$ improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.
    On the Provable Generalization of Recurrent Neural Networks. (arXiv:2109.14142v3 [cs.LG] UPDATED)
    (0 min) Recurrent Neural Network (RNN) is a fundamental structure in deep learning. Recently, some works study the training process of over-parameterized neural networks, and show that over-parameterized networks can learn functions in some notable concept classes with a provable generalization error bound. In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: 1) For a RNN with input sequence $x=(X_1,X_2,...,X_L)$, previous works study to learn functions that are summation of $f(\beta^T_lX_l)$ and require normalized conditions that $||X_l||\leq\epsilon$ with some very small $\epsilon$ depending on the complexity of $f$. In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length $L$. 2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form $f(\beta^T[X_{l_1},...,X_{l_N}])$, which do not belong to the "additive" concept class, i,e., the summation of function $f(X_l)$. And we show that when either $N$ or $l_0=\max(l_1,..,l_N)-\min(l_1,..,l_N)$ is small, $f(\beta^T[X_{l_1},...,X_{l_N}])$ will be learnable with the number iterations and samples scaling almost-polynomially in the input length $L$.
    Layer-wise Adaptive Model Aggregation for Scalable Federated Learning. (arXiv:2110.10302v3 [cs.LG] UPDATED)
    (0 min) In Federated Learning, a common approach for aggregating local models across clients is periodic averaging of the full model parameters. It is, however, known that different layers of neural networks can have a different degree of model discrepancy across the clients. The conventional full aggregation scheme does not consider such a difference and synchronizes the whole model parameters at once, resulting in inefficient network bandwidth consumption. Aggregating the parameters that are similar across the clients does not make meaningful training progress while increasing the communication cost. We propose FedLAMA, a layer-wise model aggregation scheme for scalable Federated Learning. FedLAMA adaptively adjusts the aggregation interval in a layer-wise manner, jointly considering the model discrepancy and the communication cost. The layer-wise aggregation method enables to finely control the aggregation interval to relax the aggregation frequency without a significant impact on the model accuracy. Our empirical study shows that FedLAMA reduces the communication cost by up to 60% for IID data and 70% for non-IID data while achieving a comparable accuracy to FedAvg.
    On Optimizing Interventions in Shared Autonomy. (arXiv:2112.09169v2 [cs.AI] UPDATED)
    (0 min) Shared autonomy refers to approaches for enabling an autonomous agent to collaborate with a human with the aim of improving human performance. However, besides improving performance, it may often also be beneficial that the agent concurrently accounts for preserving the user's experience or satisfaction of collaboration. In order to address this additional goal, we examine approaches for improving the user experience by constraining the number of interventions by the autonomous agent. We propose two model-free reinforcement learning methods that can account for both hard and soft constraints on the number of interventions. We show that not only does our method outperform the existing baseline, but also eliminates the need to manually tune a black-box hyperparameter for controlling the level of assistance. We also provide an in-depth analysis of intervention scenarios in order to further illuminate system understanding.
    User-Level Label Leakage from Gradients in Federated Learning. (arXiv:2105.09369v4 [cs.CR] UPDATED)
    (0 min) Federated learning enables multiple users to build a joint model by sharing their model updates (gradients), while their raw data remains local on their devices. In contrast to the common belief that this provides privacy benefits, we here add to the very recent results on privacy risks when sharing gradients. Specifically, we investigate Label Leakage from Gradients (LLG), a novel attack to extract the labels of the users' training data from their shared gradients. The attack exploits the direction and magnitude of gradients to determine the presence or absence of any label. LLG is simple yet effective, capable of leaking potential sensitive information represented by labels, and scales well to arbitrary batch sizes and multiple classes. We mathematically and empirically demonstrate the validity of the attack under different settings. Moreover, empirical results show that LLG successfully extracts labels with high accuracy at the early stages of model training. We also discuss different defense mechanisms against such leakage. Our findings suggest that gradient compression is a practical technique to mitigate the attack.
    Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints. (arXiv:2101.02195v2 [cs.LG] UPDATED)
    (0 min) We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: the batch learning model and the rare policy switch model, and propose two efficient online RL algorithms for episodic linear Markov decision processes, where the transition probability and the reward function can be represented as a linear function of some known feature mapping. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tilde O(\sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dH\log T$ policy switches suffice to obtain the $\tilde O(\sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity. We also establish a lower bound for the batch learning model, which suggests that the dependency on $B$ in our regret bound is tight.
    BM-NAS: Bilevel Multimodal Neural Architecture Search. (arXiv:2104.09379v2 [cs.CV] UPDATED)
    (0 min) Deep neural networks (DNNs) have shown superior performances on various multimodal learning problems. However, it often requires huge efforts to adapt DNNs to individual multimodal tasks by manually engineering unimodal features and designing multimodal feature fusion strategies. This paper proposes Bilevel Multimodal Neural Architecture Search (BM-NAS) framework, which makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme. At the upper level, BM-NAS selects the inter/intra-modal feature pairs from the pretrained unimodal backbones. At the lower level, BM-NAS learns the fusion strategy for each feature pair, which is a combination of predefined primitive operations. The primitive operations are elaborately designed and they can be flexibly combined to accommodate various effective feature fusion modules such as multi-head attention (Transformer) and Attention on Attention (AoA). Experimental results on three multimodal tasks demonstrate the effectiveness and efficiency of the proposed BM-NAS framework. BM-NAS achieves competitive performances with much less search time and fewer model parameters in comparison with the existing generalized multimodal NAS methods.
    Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs. (arXiv:2010.00587v3 [cs.LG] UPDATED)
    (0 min) We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.
    NCVX: A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning. (arXiv:2111.13984v2 [cs.LG] UPDATED)
    (0 min) Optimizing nonconvex (NCVX) problems, especially nonsmooth and constrained ones, is an essential part of machine learning. However, it can be hard to reliably solve such problems without optimization expertise. Existing general-purpose NCVX optimization packages are powerful but typically cannot handle nonsmoothness. GRANSO is among the first optimization solvers targeting general nonsmooth NCVX problems with nonsmooth constraints, but, as it is implemented in MATLAB and requires the user to provide analytical gradients, GRANSO is often not a convenient choice in machine learning (especially deep learning) applications. To greatly lower the technical barrier, we introduce a new software package called NCVX, whose initial release contains the solver PyGRANSO, a PyTorch-enabled port of GRANSO incorporating auto-differentiation, GPU acceleration, tensor input, and support for new QP solvers. NCVX is built on freely available and widely used open-source frameworks, and as a highlight, can solve general constrained deep learning problems, the first of its kind. NCVX is available at https://ncvx.org, with detailed documentation and numerous examples from machine learning and other fields.
    Graph Pointer Neural Networks. (arXiv:2110.00973v2 [cs.LG] UPDATED)
    (0 min) Graph Neural Networks (GNNs) have shown advantages in various graph-based applications. Most existing GNNs assume strong homophily of graph structure and apply permutation-invariant local aggregation of neighbors to learn a representation for each node. However, they fail to generalize to heterophilic graphs, where most neighboring nodes have different labels or features, and the relevant nodes are distant. Few recent studies attempt to address this problem by combining multiple hops of hidden representations of central nodes (i.e., multi-hop-based approaches) or sorting the neighboring nodes based on attention scores (i.e., ranking-based approaches). As a result, these approaches have some apparent limitations. On the one hand, multi-hop-based approaches do not explicitly distinguish relevant nodes from a large number of multi-hop neighborhoods, leading to a severe over-smoothing problem. On the other hand, ranking-based models do not joint-optimize node ranking with end tasks and result in sub-optimal solutions. In this work, we present Graph Pointer Neural Networks (GPNN) to tackle the challenges mentioned above. We leverage a pointer network to select the most relevant nodes from a large amount of multi-hop neighborhoods, which constructs an ordered sequence according to the relationship with the central node. 1D convolution is then applied to extract high-level features from the node sequence. The pointer-network-based ranker in GPNN is joint-optimized with other parts in an end-to-end manner. Extensive experiments are conducted on six public node classification datasets with heterophilic graphs. The results show that GPNN significantly improves the classification performance of state-of-the-art methods. In addition, analyses also reveal the privilege of the proposed GPNN in filtering out irrelevant neighbors and reducing over-smoothing.
    A Survey of Generalisation in Deep Reinforcement Learning. (arXiv:2111.09794v3 [cs.LG] UPDATED)
    (0 min) The study of generalisation in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We provide a unifying formalism and terminology for discussing different generalisation problems, building upon previous works. We go on to categorise existing benchmarks for generalisation, as well as current methods for tackling the generalisation problem. Finally, we provide a critical discussion of the current state of the field, including recommendations for future work. Among other conclusions, we argue that taking a purely procedural content generation approach to benchmark design is not conducive to progress in generalisation, we suggest fast online adaptation and tackling RL-specific problems as some areas for future work on methods for generalisation, and we recommend building benchmarks in underexplored problem settings such as offline RL generalisation and reward-function variation.
    Sample Complexity of Learning Parametric Quantum Circuits. (arXiv:2107.09078v2 [quant-ph] UPDATED)
    (0 min) Quantum computers hold unprecedented potentials for machine learning applications. Here, we prove that physical quantum circuits are PAC (probably approximately correct) learnable on a quantum computer via empirical risk minimization: to learn a parametric quantum circuit with at most $n^c$ gates and each gate acting on a constant number of qubits, the sample complexity is bounded by $\tilde{O}(n^{c+1})$. In particular, we explicitly construct a family of variational quantum circuits with $O(n^{c+1})$ elementary gates arranged in a fixed pattern, which can represent all physical quantum circuits consisting of at most $n^c$ elementary gates. Our results provide a valuable guide for quantum machine learning in both theory and practice.
    High-dimensional Bayesian Optimization Algorithm with Recurrent Neural Network for Disease Control Models in Time Series. (arXiv:2201.00147v1 [cs.LG])
    (0 min) Bayesian Optimization algorithm has become a promising approach for nonlinear global optimization problems and many machine learning applications. Over the past few years, improvements and enhancements have been brought forward and they have shown some promising results in solving the complex dynamic problems, systems of ordinary differential equations where the objective functions are computationally expensive to evaluate. Besides, the straightforward implementation of the Bayesian Optimization algorithm performs well merely for optimization problems with 10-20 dimensions. The study presented in this paper proposes a new high dimensional Bayesian Optimization algorithm combining Recurrent neural networks, which is expected to predict the optimal solution for the global optimization problems with high dimensional or time series decision models. The proposed RNN-BO algorithm can solve the optimal control problems in the lower dimension space and then learn from the historical data using the recurrent neural network to learn the historical optimal solution data and predict the optimal control strategy for any new initial system value setting. In addition, accurately and quickly providing the optimal control strategy is essential to effectively and efficiently control the epidemic spread while minimizing the associated financial costs. Therefore, to verify the effectiveness of the proposed algorithm, computational experiments are carried out on a deterministic SEIR epidemic model and a stochastic SIS optimal control model. Finally, we also discuss the impacts of different numbers of the RNN layers and training epochs on the trade-off between solution quality and related computational efforts.
    FUSeg: The Foot Ulcer Segmentation Challenge. (arXiv:2201.00414v1 [eess.IV])
    (0 min) Acute and chronic wounds with varying etiologies burden the healthcare systems economically. The advanced wound care market is estimated to reach $22 billion by 2024. Wound care professionals provide proper diagnosis and treatment with heavy reliance on images and image documentation. Segmentation of wound boundaries in images is a key component of the care and diagnosis protocol since it is important to estimate the area of the wound and provide quantitative measurement for the treatment. Unfortunately, this process is very time-consuming and requires a high level of expertise. Recently automatic wound segmentation methods based on deep learning have shown promising performance but require large datasets for training and it is unclear which methods perform better. To address these issues, we propose the Foot Ulcer Segmentation challenge (FUSeg) organized in conjunction with the 2021 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). We built a wound image dataset containing 1,210 foot ulcer images collected over 2 years from 889 patients. It is pixel-wise annotated by wound care experts and split into a training set with 1010 images and a testing set with 200 images for evaluation. Teams around the world developed automated methods to predict wound segmentations on the testing set of which annotations were kept private. The predictions were evaluated and ranked based on the average Dice coefficient. The FUSeg challenge remains an open challenge as a benchmark for wound segmentation after the conference.
    Deep Representation Learning with an Information-theoretic Loss. (arXiv:2111.12950v3 [cs.LG] UPDATED)
    (0 min) This paper proposes an information-theoretic loss for learning deep neural networks. We propose a loss function based on the Information Bottleneck principle and a max-margin loss with an aim to increase the class separability in the embedded space. While deep neural network models have excelled in supervised learning tasks with large-scale labeled data available, they are prone to practical issues in testing samples outside of classes shown in training, e.g., anomaly detection and out-of-distribution detection. In such tasks, it is not sufficient to merely discriminate between known classes. Our intuition is to represent the known classes in compact and separated embedded regions in order to decrease the possibility of known and unseen classes largely overlapping in the embedded space. We show that the IB-based loss function reflects the inter-class distances as well as the compactness within classes, thus will extend the extending models of the existing deep data description models. Our empirical study shows that the proposed model improves the segmentation of normal classes in the deep feature space which contributes to identifying the out-of-distribution samples.
    Rank-1 Similarity Matrix Decomposition For Modeling Changes in Antivirus Consensus Through Time. (arXiv:2201.00757v1 [cs.CR])
    (0 min) Although groups of strongly correlated antivirus engines are known to exist, at present there is limited understanding of how or why these correlations came to be. Using a corpus of 25 million VirusTotal reports representing over a decade of antivirus scan data, we challenge prevailing wisdom that these correlations primarily originate from "first-order" interactions such as antivirus vendors copying the labels of leading vendors. We introduce the Temporal Rank-1 Similarity Matrix decomposition (R1SM-T) in order to investigate the origins of these correlations and to model how consensus amongst antivirus engines changes over time. We reveal that first-order interactions do not explain as much behavior in antivirus correlation as previously thought, and that the relationships between antivirus engines are highly volatile. We make recommendations on items in need of future study and consideration based on our findings.
    Feature matching as improved transfer learning technique for wearable EEG. (arXiv:2201.00644v1 [eess.SP])
    (0 min) Objective: With the rapid rise of wearable sleep monitoring devices with non-conventional electrode configurations, there is a need for automated algorithms that can perform sleep staging on configurations with small amounts of labeled data. Transfer learning has the ability to adapt neural network weights from a source modality (e.g. standard electrode configuration) to a new target modality (e.g. non-conventional electrode configuration). Methods: We propose feature matching, a new transfer learning strategy as an alternative to the commonly used finetuning approach. This method consists of training a model with larger amounts of data from the source modality and few paired samples of source and target modality. For those paired samples, the model extracts features of the target modality, matching these to the features from the corresponding samples of the source modality. Results: We compare feature matching to finetuning for three different target domains, with two different neural network architectures, and with varying amounts of training data. Particularly on small cohorts (i.e. 2 - 5 labeled recordings in the non-conventional recording setting), feature matching systematically outperforms finetuning with mean relative differences in accuracy ranging from 0.4% to 4.7% for the different scenarios and datasets. Conclusion: Our findings suggest that feature matching outperforms finetuning as a transfer learning approach, especially in very low data regimes. Significance: As such, we conclude that feature matching is a promising new method for wearable sleep staging with novel devices.
    Deep Learning Applications for Lung Cancer Diagnosis: A systematic review. (arXiv:2201.00227v1 [eess.IV])
    (0 min) Lung cancer has been one of the most prevalent disease in recent years. According to the research of this field, more than 200,000 cases are identified each year in the US. Uncontrolled multiplication and growth of the lung cells result in malignant tumour formation. Recently, deep learning algorithms, especially Convolutional Neural Networks (CNN), have become a superior way to automatically diagnose disease. The purpose of this article is to review different models that lead to different accuracy and sensitivity in the diagnosis of early-stage lung cancer and to help physicians and researchers in this field. The main purpose of this work is to identify the challenges that exist in lung cancer based on deep learning. The survey is systematically written that combines regular mapping and literature review to review 32 conference and journal articles in the field from 2016 to 2021. After analysing and reviewing the articles, the questions raised in the articles are being answered. This research is superior to other review articles in this field due to the complete review of relevant articles and systematic write up.
    Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels. (arXiv:2102.02976v3 [stat.ML] UPDATED)
    (0 min) Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution-dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks.
    Provable Meta-Learning of Linear Representations. (arXiv:2002.11684v5 [cs.LG] UPDATED)
    (0 min) Meta-learning, or learning-to-learn, seeks to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments. Representation learning -- a key tool for performing meta-learning -- learns a data representation that can transfer knowledge across multiple tasks, which is essential in regimes where data is scarce. Despite a recent surge of interest in the practice of meta-learning, the theoretical underpinnings of meta-learning algorithms are lacking, especially in the context of learning transferable representations. In this paper, we focus on the problem of multi-task linear regression -- in which multiple linear regression models share a common, low-dimensional linear representation. Here, we provide provably fast, sample-efficient algorithms to address the dual challenges of (1) learning a common set of features from multiple, related tasks, and (2) transferring this knowledge to new, unseen tasks. Both are central to the general problem of meta-learning. Finally, we complement these results by providing information-theoretic lower bounds on the sample complexity of learning these linear features.
    Audiogmenter: a MATLAB Toolbox for Audio Data Augmentation. (arXiv:1912.05472v4 [eess.AS] UPDATED)
    (0 min) Audio data augmentation is a key step in training deep neural networks for solving audio classification tasks. In this paper, we introduce Audiogmenter, a novel audio data augmentation library in MATLAB. We provide 15 different augmentation algorithms for raw audio data and 8 for spectrograms. We efficiently implemented several augmentation techniques whose usefulness has been extensively proved in the literature. To the best of our knowledge, this is the largest MATLAB audio data augmentation library freely available. We validate the efficiency of our algorithms evaluating them on the ESC-50 dataset. The toolbox and its documentation can be downloaded at https://github.com/LorisNanni/Audiogmenter.
    VDPC: Variational Density Peak Clustering Algorithm. (arXiv:2201.00641v1 [cs.LG])
    (0 min) The widely applied density peak clustering (DPC) algorithm makes an intuitive cluster formation assumption that cluster centers are often surrounded by data points with lower local density and far away from other data points with higher local density. However, this assumption suffers from one limitation that it is often problematic when identifying clusters with lower density because they might be easily merged into other clusters with higher density. As a result, DPC may not be able to identify clusters with variational density. To address this issue, we propose a variational density peak clustering (VDPC) algorithm, which is designed to systematically and autonomously perform the clustering task on datasets with various types of density distributions. Specifically, we first propose a novel method to identify the representatives among all data points and construct initial clusters based on the identified representatives for further analysis of the clusters' property. Furthermore, we divide all data points into different levels according to their local density and propose a unified clustering framework by combining the advantages of both DPC and DBSCAN. Thus, all the identified initial clusters spreading across different density levels are systematically processed to form the final clusters. To evaluate the effectiveness of the proposed VDPC algorithm, we conduct extensive experiments using 20 datasets including eight synthetic, six real-world and six image datasets. The experimental results show that VDPC outperforms two classical algorithms (i.e., DPC and DBSCAN) and four state-of-the-art extended DPC algorithms.
    SLURP: Side Learning Uncertainty for Regression Problems. (arXiv:2110.11182v2 [cs.CV] UPDATED)
    (0 min) It has become critical for deep learning algorithms to quantify their output uncertainties to satisfy reliability constraints and provide accurate results. Uncertainty estimation for regression has received less attention than classification due to the more straightforward standardized output of the latter class of tasks and their high importance. However, regression problems are encountered in a wide range of applications in computer vision. We propose SLURP, a generic approach for regression uncertainty estimation via a side learner that exploits the output and the intermediate representations generated by the main task model. We test SLURP on two critical regression tasks in computer vision: monocular depth and optical flow estimation. In addition, we conduct exhaustive benchmarks comprising transfer to different datasets and the addition of aleatoric noise. The results show that our proposal is generic and readily applicable to various regression problems and has a low computational cost with respect to existing solutions.
    Continuous Submodular Maximization: Boosting via Non-oblivious Function. (arXiv:2201.00703v1 [cs.LG])
    (0 min) In this paper, we revisit the constrained and stochastic continuous submodular maximization in both offline and online settings. For each $\gamma$-weakly DR-submodular function $f$, we use the factor-revealing optimization equation to derive an optimal auxiliary function $F$, whose stationary points provide a $(1-e^{-\gamma})$-approximation to the global maximum value (denoted as $OPT$) of problem $\max_{\boldsymbol{x}\in\mathcal{C}}f(\boldsymbol{x})$. Naturally, the projected (mirror) gradient ascent relied on this non-oblivious function achieves $(1-e^{-\gamma}-\epsilon^{2})OPT-\epsilon$ after $O(1/\epsilon^{2})$ iterations, beating the traditional $(\frac{\gamma^{2}}{1+\gamma^{2}})$-approximation gradient ascent \citep{hassani2017gradient} for submodular maximization. Similarly, based on $F$, the classical Frank-Wolfe algorithm equipped with variance reduction technique \citep{mokhtari2018conditional} also returns a solution with objective value larger than $(1-e^{-\gamma}-\epsilon^{2})OPT-\epsilon$ after $O(1/\epsilon^{3})$ iterations. In the online setting, we first consider the adversarial delays for stochastic gradient feedback, under which we propose a boosting online gradient algorithm with the same non-oblivious search, achieving a regret of $\sqrt{D}$ (where $D$ is the sum of delays of gradient feedback) against a $(1-e^{-\gamma})$-approximation to the best feasible solution in hindsight. Finally, extensive numerical experiments demonstrate the efficiency of our boosting methods.
    Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space. (arXiv:2201.00814v1 [cs.CV])
    (0 min) This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework that can search such a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified l1 sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes ~43 GPU hours for searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a re-training process is performed to obtain the final models. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by ~0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our source code will be publicly available.
    Detecting Misclassification Errors in Neural Networks with a Gaussian Process Model. (arXiv:2010.02065v4 [cs.LG] UPDATED)
    (0 min) As neural network classifiers are deployed in real-world applications, it is crucial that their failures can be detected reliably. One practical solution is to assign confidence scores to each prediction, then use these scores to filter out possible misclassifications. However, existing confidence metrics are not yet sufficiently reliable for this role. This paper presents a new framework that produces a quantitative metric for detecting misclassification errors. This framework, RED, builds an error detector on top of the base classifier and estimates uncertainty of the detection scores using Gaussian Processes. Experimental comparisons with other error detection methods on 125 UCI datasets demonstrate that this approach is effective. Further implementations on two probabilistic base classifiers and two large deep learning architecture in vision tasks further confirm that the method is robust and scalable. Third, an empirical analysis of RED with out-of-distribution and adversarial samples shows that the method can be used not only to detect errors but also to understand where they come from. RED can thereby be used to improve trustworthiness of neural network classifiers more broadly in the future.
    Machine learning moment closure models for the radiative transfer equation I: directly learning a gradient based closure. (arXiv:2105.05690v2 [math.NA] UPDATED)
    (0 min) In this paper, we take a data-driven approach and apply machine learning to the moment closure problem for radiative transfer equation in slab geometry. Instead of learning the unclosed high order moment, we propose to directly learn the gradient of the high order moment using neural networks. This new approach is consistent with the exact closure we derive for the free streaming limit and also provides a natural output normalization. A variety of benchmark tests, including the variable scattering problem, the Gaussian source problem with both periodic and reflecting boundaries, and the two-material problem, show both good accuracy and generalizability of our machine learning closure model.
    Estimating Causal Effects of Multi-Aspect Online Reviews with Multi-Modal Proxies. (arXiv:2112.10274v2 [cs.LG] UPDATED)
    (0 min) Online reviews enable consumers to engage with companies and provide important feedback. Due to the complexity of the high-dimensional text, these reviews are often simplified as a single numerical score, e.g., ratings or sentiment scores. This work empirically examines the causal effects of user-generated online reviews on a granular level: we consider multiple aspects, e.g., the Food and Service of a restaurant. Understanding consumers' opinions toward different aspects can help evaluate business performance in detail and strategize business operations effectively. Specifically, we aim to answer interventional questions such as What will the restaurant popularity be if the quality w.r.t. its aspect Service is increased by 10%? The defining challenge of causal inference with observational data is the presence of "confounder", which might not be observed or measured, e.g., consumers' preference to food type, rendering the estimated effects biased and high-variance. To address this challenge, we have recourse to the multi-modal proxies such as the consumer profile information and interactions between consumers and businesses. We show how to effectively leverage the rich information to identify and estimate causal effects of multiple aspects embedded in online reviews. Empirical evaluations on synthetic and real-world data corroborate the efficacy and shed light on the actionable insight of the proposed approach.
    An Attention Score Based Attacker for Black-box NLP Classifier. (arXiv:2112.11660v2 [cs.LG] UPDATED)
    (0 min) Deep neural networks have a wide range of applications in solving various real-world tasks and have achieved satisfactory results, in domains such as computer vision, image classification, and natural language processing. Meanwhile, the security and robustness of neural networks have become imperative, as diverse researches have shown the vulnerable aspects of neural networks. Case in point, in Natural language processing tasks, the neural network may be fooled by an attentively modified text, which has a high similarity to the original one. As per previous research, most of the studies are focused on the image domain; Different from image adversarial attacks, the text is represented in a discrete sequence, traditional image attack methods are not applicable in the NLP field. In this paper, we propose a word-level NLP sentiment classifier attack model, which includes a self-attention mechanism-based word selection method and a greedy search algorithm for word substitution. We experiment with our attack model by attacking GRU and 1D-CNN victim models on IMDB datasets. Experimental results demonstrate that our model achieves a higher attack success rate and more efficient than previous methods due to the efficient word selection algorithms are employed and minimized the word substitute number. Also, our model is transferable, which can be used in the image domain with several modifications.
    Execute Order 66: Targeted Data Poisoning for Reinforcement Learning. (arXiv:2201.00762v1 [cs.LG])
    (0 min) Data poisoning for reinforcement learning has historically focused on general performance degradation, and targeted attacks have been successful via perturbations that involve control of the victim's policy and rewards. We introduce an insidious poisoning attack for reinforcement learning which causes agent misbehavior only at specific target states - all while minimally modifying a small fraction of training observations without assuming any control over policy or reward. We accomplish this by adapting a recent technique, gradient alignment, to reinforcement learning. We test our method and demonstrate success in two Atari games of varying difficulty.
    Online Regularization towards Always-Valid High-Dimensional Dynamic Pricing. (arXiv:2007.02470v2 [stat.ML] UPDATED)
    (0 min) Devising dynamic pricing policy with always valid online statistical learning procedure is an important and as yet unresolved problem. Most existing dynamic pricing policy, which focus on the faithfulness of adopted customer choice models, exhibit a limited capability for adapting the online uncertainty of learned statistical model during pricing process. In this paper, we propose a novel approach for designing dynamic pricing policy based regularized online statistical learning with theoretical guarantees. The new approach overcomes the challenge of continuous monitoring of online Lasso procedure and possesses several appealing properties. In particular, we make the decisive observation that the always-validity of pricing decisions builds and thrives on the online regularization scheme. Our proposed online regularization scheme equips the proposed optimistic online regularized maximum likelihood pricing (OORMLP) pricing policy with three major advantages: encode market noise knowledge into pricing process optimism; empower online statistical learning with always-validity over all decision points; envelop prediction error process with time-uniform non-asymptotic oracle inequalities. This type of non-asymptotic inference results allows us to design more sample-efficient and robust dynamic pricing algorithms in practice. In theory, the proposed OORMLP algorithm exploits the sparsity structure of high-dimensional models and secures a logarithmic regret in a decision horizon. These theoretical advances are made possible by proposing an optimistic online Lasso procedure that resolves dynamic pricing problems at the process level, based on a novel use of non-asymptotic martingale concentration. In experiments, we evaluate OORMLP in different synthetic and real pricing problem settings, and demonstrate that OORMLP advances the state-of-the-art methods.
    Randomized Signature Layers for Signal Extraction in Time Series Data. (arXiv:2201.00384v1 [cs.LG])
    (0 min) Time series analysis is a widespread task in Natural Sciences, Social Sciences, and Engineering. A fundamental problem is finding an expressive yet efficient-to-compute representation of the input time series to use as a starting point to perform arbitrary downstream tasks. In this paper, we build upon recent works that use the Signature of a path as a feature map and investigate a computationally efficient technique to approximate these features based on linear random projections. We present several theoretical results to justify our approach and empirically validate that our random projections can effectively retrieve the underlying Signature of a path. We show the surprising performance of the proposed random features on several tasks, including (1) mapping the controls of stochastic differential equations to the corresponding solutions and (2) using the Randomized Signatures as time series representation for classification tasks. When compared to corresponding truncated Signature approaches, our Randomizes Signatures are more computationally efficient in high dimensions and often lead to better accuracy and faster training. Besides providing a new tool to extract Signatures and further validating the high level of expressiveness of such features, we believe our results provide interesting conceptual links between several existing research areas, suggesting new intriguing directions for future investigations.
    Neural combinatorial optimization beyond the TSP: Existing architectures under-represent graph structure. (arXiv:2201.00668v1 [cs.AI])
    (0 min) Recent years have witnessed the promise that reinforcement learning, coupled with Graph Neural Network (GNN) architectures, could learn to solve hard combinatorial optimization problems: given raw input data and an evaluator to guide the process, the idea is to automatically learn a policy able to return feasible and high-quality outputs. Recent work have shown promising results but the latter were mainly evaluated on the travelling salesman problem (TSP) and similar abstract variants such as Split Delivery Vehicle Routing Problem (SDVRP). In this paper, we analyze how and whether recent neural architectures can be applied to graph problems of practical importance. We thus set out to systematically "transfer" these architectures to the Power and Channel Allocation Problem (PCAP), which has practical relevance for, e.g., radio resource allocation in wireless networks. Our experimental results suggest that existing architectures (i) are still incapable of capturing graph structural features and (ii) are not suitable for problems where the actions on the graph change the graph attributes. On a positive note, we show that augmenting the structural representation of problems with Distance Encoding is a promising step towards the still-ambitious goal of learning multi-purpose autonomous solvers.
    Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation. (arXiv:2105.07059v3 [cs.CV] UPDATED)
    (0 min) Automated segmentation in medical image analysis is a challenging task that requires a large amount of manually labeled data. However, manually annotating medical data is often laborious, and most existing learning-based approaches fail to accurately delineate object boundaries without effective geometric constraints. Contrastive learning, a sub-area of self-supervised learning, has recently been noted as a promising direction in multiple application fields. In this work, we present a novel Contrastive Voxel-wise Representation Distillation (CVRD) method with geometric constraints to learn global-local visual representations for volumetric medical image segmentation with limited annotations. Our framework can effectively learn global and local features by capturing 3D spatial context and rich anatomical information. Specifically, we introduce a voxel-to-volume contrastive algorithm to learn global information from 3D images, and propose to perform local voxel-to-voxel distillation to explicitly make use of local cues in the embedding space. Moreover, we integrate an elastic interaction-based active contour model as a geometric regularization term to enable fast and reliable object delineations in an end-to-end learning manner. Results on the Atrial Segmentation Challenge dataset demonstrate superiority of our proposed scheme, especially in a setting with a very limited number of annotated data. The code will be available at https://github.com/charlesyou999648/CVRD.
    Robust Contrastive Learning Using Negative Samples with Diminished Semantics. (arXiv:2110.14189v2 [cs.CV] UPDATED)
    (0 min) Unsupervised learning has recently made exceptional progress because of the development of more effective contrastive learning methods. However, CNNs are prone to depend on low-level features that humans deem non-semantic. This dependency has been conjectured to induce a lack of robustness to image perturbations or domain shift. In this paper, we show that by generating carefully designed negative samples, contrastive learning can learn more robust representations with less dependence on such features. Contrastive learning utilizes positive pairs that preserve semantic information while perturbing superficial features in the training images. Similarly, we propose to generate negative samples in a reversed way, where only the superfluous instead of the semantic features are preserved. We develop two methods, texture-based and patch-based augmentations, to generate negative samples. These samples achieve better generalization, especially under out-of-domain settings. We also analyze our method and the generated texture-based samples, showing that texture features are indispensable in classifying particular ImageNet classes and especially finer classes. We also show that model bias favors texture and shape features differently under different test settings. Our code, trained models, and ImageNet-Texture dataset can be found at https://github.com/SongweiGe/Contrastive-Learning-with-Non-Semantic-Negatives.
    Snowflake: Scaling GNNs to High-Dimensional Continuous Control via Parameter Freezing. (arXiv:2103.01009v3 [cs.LG] UPDATED)
    (0 min) Recent research has shown that graph neural networks (GNNs) can learn policies for locomotion control that are as effective as a typical multi-layer perceptron (MLP), with superior transfer and multi-task performance (Wang et al., 2018; Huang et al., 2020). Results have so far been limited to training on small agents, with the performance of GNNs deteriorating rapidly as the number of sensors and actuators grows. A key motivation for the use of GNNs in the supervised learning setting is their applicability to large graphs, but this benefit has not yet been realised for locomotion control. We identify the weakness with a common GNN architecture that causes this poor scaling: overfitting in the MLPs within the network that encode, decode, and propagate messages. To combat this, we introduce Snowflake, a GNN training method for high-dimensional continuous control that freezes parameters in parts of the network that suffer from overfitting. Snowflake significantly boosts the performance of GNNs for locomotion control on large agents, now matching the performance of MLPs, and with superior transfer properties.
    Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning. (arXiv:2102.03479v18 [cs.LG] UPDATED)
    (0 min) Many complex multi-agent systems such as robot swarms control and autonomous vehicle coordination can be modeled as Multi-Agent Reinforcement Learning (MARL) tasks. QMIX, a widely popular MARL algorithm, has been used as a baseline for the benchmark environments, e.g., Starcraft Multi-Agent Challenge (SMAC), Difficulty-Enhanced Predator-Prey (DEPP). Recent variants of QMIX target relaxing the monotonicity constraint of QMIX, allowing for performance improvement in SMAC. In this paper, we investigate the code-level optimizations of these variants and the monotonicity constraint. (1) We find that such improvements of the variants are significantly affected by various code-level optimizations. (2) The experiment results show that QMIX with normalized optimizations outperforms other works in SMAC; (3) beyond the common wisdom from these works, the monotonicity constraint can improve sample efficiency in SMAC and DEPP. We also discuss why monotonicity constraints work well in purely cooperative tasks with a theoretical analysis. We open-source the code at \url{https://github.com/hijkzzz/pymarl2}.
    A Cluster-Based Trip Prediction Graph Neural Network Model for Bike Sharing Systems. (arXiv:2201.00720v1 [cs.LG])
    (0 min) Bike Sharing Systems (BSSs) are emerging as an innovative transportation service. Ensuring the proper functioning of a BSS is crucial given that these systems are committed to eradicating many of the current global concerns, by promoting environmental and economic sustainability and contributing to improving the life quality of the population. Good knowledge of users' transition patterns is a decisive contribution to the quality and operability of the service. The analogous and unbalanced users' transition patterns cause these systems to suffer from bicycle imbalance, leading to a drastic customer loss in the long term. Strategies for bicycle rebalancing become important to tackle this problem and for this, bicycle traffic prediction is essential, as it allows to operate more efficiently and to react in advance. In this work, we propose a bicycle trips predictor based on Graph Neural Network embeddings, taking into consideration station groupings, meteorology conditions, geographical distances, and trip patterns. We evaluated our approach in the New York City BSS (CitiBike) data and compared it with four baselines, including the non-clustered approach. To address our problem's specificities, we developed the Adaptive Transition Constraint Clustering Plus (AdaTC+) algorithm, eliminating shortcomings of previous work. Our experiments evidence the clustering pertinence (88% accuracy compared with 83% without clustering) and which clustering technique best suits this problem. Accuracy on the Link Prediction task is always higher for AdaTC+ than benchmark clustering methods when the stations are the same, while not degrading performance when the network is upgraded, in a mismatch with the trained model.
    AutoDES: AutoML Pipeline Generation of Classification with Dynamic Ensemble Strategy Selection. (arXiv:2201.00207v1 [cs.LG])
    (0 min) Automating machine learning has achieved remarkable technological developments in recent years, and building an automated machine learning pipeline is now an essential task. The model ensemble is the technique of combining multiple models to get a better and more robust model. However, existing automated machine learning tends to be simplistic in handling the model ensemble, where the ensemble strategy is fixed, such as stacked generalization. There have been many techniques on different ensemble methods, especially ensemble selection, and the fixed ensemble strategy limits the upper limit of the model's performance. In this article, we present a novel framework for automated machine learning. Our framework incorporates advances in dynamic ensemble selection, and to our best knowledge, our approach is the first in the field of AutoML to search and optimize ensemble strategies. In the comparison experiments, our method outperforms the state-of-the-art automated machine learning frameworks with the same CPU time in 42 classification datasets from the OpenML platform. Ablation experiments on our framework validate the effectiveness of our proposed method.
    Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement. (arXiv:2003.13917v2 [eess.AS] UPDATED)
    (0 min) Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems. In this work, we present a U-Net based attention model, U-Net$_{At}$, to enhance adversarial speech signals. Specifically, we evaluate the model performance by interpretable speech recognition metrics and discuss the model performance by the augmented adversarial training. Our experiments show that our proposed U-Net$_{At}$ improves the perceptual evaluation of speech quality (PESQ) from 1.13 to 2.78, speech transmission index (STI) from 0.65 to 0.75, short-term objective intelligibility (STOI) from 0.83 to 0.96 on the task of speech enhancement with adversarial speech examples. We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks. We find that (i) temporal features learned by the attention network are capable of enhancing the robustness of DNN based ASR models; (ii) the generalization power of DNN based ASR model could be enhanced by applying adversarial training with an additive adversarial data augmentation. The ASR metric on word-error-rates (WERs) shows that there is an absolute 2.22 $\%$ decrease under gradient-based perturbation, and an absolute 2.03 $\%$ decrease, under evolutionary-optimized perturbation, which suggests that our enhancement models with adversarial training can further secure a resilient ASR system.
    A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models. (arXiv:2106.12887v2 [cs.LG] UPDATED)
    (0 min) We present a scalable post-processing algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. We empirically validate its advantages on standard benchmark datasets across both classical algorithms as well as modern DNN architectures and demonstrate that it outperforms previous post-processing methods while performing on par with in-processing. In addition, we show that the proposed algorithm is particularly effective for models trained at scale where post-processing is a natural and practical choice.
    DDPG car-following model with real-world human driving experience in CARLA. (arXiv:2112.14602v1 [cs.RO] CROSS LISTED)
    (0 min) In the autonomous driving field, the fusion of human knowledge into Deep Reinforcement Learning (DRL) is often based on the human demonstration recorded in the simulated environment. This limits the generalization and the feasibility of application in real-world traffic. We proposed a two-stage DRL method, that learns from real-world human driving to achieve performance that is superior to the pure DRL agent. Training a DRL agent is done within a framework for CARLA with Robot Operating System (ROS). For evaluation, we designed different real-world driving scenarios to compare the proposed two-stage DRL agent with the pure DRL agent. After extracting the 'good' behavior from the human driver, such as anticipation in a signalized intersection, the agent becomes more efficient and drives safer, which makes this autonomous agent more adapt to Human-Robot Interaction (HRI) traffic.
    An Embedding of ReLU Networks and an Analysis of their Identifiability. (arXiv:2107.09370v2 [cs.LG] UPDATED)
    (0 min) Neural networks with the Rectified Linear Unit (ReLU) nonlinearity are described by a vector of parameters $\theta$, and realized as a piecewise linear continuous function $R_{\theta}: x \in \mathbb R^{d} \mapsto R_{\theta}(x) \in \mathbb R^{k}$. Natural scalings and permutations operations on the parameters $\theta$ leave the realization unchanged, leading to equivalence classes of parameters that yield the same realization. These considerations in turn lead to the notion of identifiability -- the ability to recover (the equivalence class of) $\theta$ from the sole knowledge of its realization $R_{\theta}$. The overall objective of this paper is to introduce an embedding for ReLU neural networks of any depth, $\Phi(\theta)$, that is invariant to scalings and that provides a locally linear parameterization of the realization of the network. Leveraging these two key properties, we derive some conditions under which a deep ReLU network is indeed locally identifiable from the knowledge of the realization on a finite set of samples $x_{i} \in \mathbb R^{d}$. We study the shallow case in more depth, establishing necessary and sufficient conditions for the network to be identifiable from a bounded subset $\mathcal X \subseteq \mathbb R^{d}$.
    The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization. (arXiv:2106.07769v3 [cs.LG] UPDATED)
    (0 min) Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.
    AHAR: Adaptive CNN for Energy-efficient Human Activity Recognition in Low-power Edge Devices. (arXiv:2102.01875v3 [cs.LG] UPDATED)
    (0 min) Human Activity Recognition (HAR) is one of the key applications of health monitoring that requires continuous use of wearable devices to track daily activities. This paper proposes an Adaptive CNN for energy-efficient HAR (AHAR) suitable for low-power edge devices. Unlike traditional early exit architecture that makes the exit decision based on classification confidence, AHAR proposes a novel adaptive architecture that uses an output block predictor to select a portion of the baseline architecture to use during the inference phase. Experimental results show that traditional early exit architectures suffer from performance loss whereas our adaptive architecture provides similar or better performance as the baseline one while being energy-efficient. We validate our methodology in classifying locomotion activities from two datasets- Opportunity and w-HAR. Compared to the fog/cloud computing approaches for the Opportunity dataset, our baseline and adaptive architecture shows a comparable weighted F1 score of 91.79%, and 91.57%, respectively. For the w-HAR dataset, our baseline and adaptive architecture outperforms the state-of-the-art works with a weighted F1 score of 97.55%, and 97.64%, respectively. Evaluation on real hardware shows that our baseline architecture is significantly energy-efficient (422.38x less) and memory-efficient (14.29x less) compared to the works on the Opportunity dataset. For the w-HAR dataset, our baseline architecture requires 2.04x less energy and 2.18x less memory compared to the state-of-the-art work. Moreover, experimental results show that our adaptive architecture is 12.32% (Opportunity) and 11.14% (w-HAR) energy-efficient than our baseline while providing similar (Opportunity) or better (w-HAR) performance with no significant memory overhead.
    DeepSight: Mitigating Backdoor Attacks in Federated Learning Through Deep Model Inspection. (arXiv:2201.00763v1 [cs.CR])
    (0 min) Federated Learning (FL) allows multiple clients to collaboratively train a Neural Network (NN) model on their private data without revealing the data. Recently, several targeted poisoning attacks against FL have been introduced. These attacks inject a backdoor into the resulting model that allows adversary-controlled inputs to be misclassified. Existing countermeasures against backdoor attacks are inefficient and often merely aim to exclude deviating models from the aggregation. However, this approach also removes benign models of clients with deviating data distributions, causing the aggregated model to perform poorly for such clients. To address this problem, we propose DeepSight, a novel model filtering approach for mitigating backdoor attacks. It is based on three novel techniques that allow to characterize the distribution of data used to train model updates and seek to measure fine-grained differences in the internal structure and outputs of NNs. Using these techniques, DeepSight can identify suspicious model updates. We also develop a scheme that can accurately cluster model updates. Combining the results of both components, DeepSight is able to identify and eliminate model clusters containing poisoned models with high attack impact. We also show that the backdoor contributions of possibly undetected poisoned models can be effectively mitigated with existing weight clipping-based defenses. We evaluate the performance and effectiveness of DeepSight and show that it can mitigate state-of-the-art backdoor attacks with a negligible impact on the model's performance on benign data.
    Class-Incremental Continual Learning into the eXtended DER-verse. (arXiv:2201.00766v1 [cs.LG])
    (0 min) The staple of human intelligence is the capability of acquiring knowledge in a continuous fashion. In stark contrast, Deep Networks forget catastrophically and, for this reason, the sub-field of Class-Incremental Continual Learning fosters methods that learn a sequence of tasks incrementally, blending sequentially-gained knowledge into a comprehensive prediction. This work aims at assessing and overcoming the pitfalls of our previous proposal Dark Experience Replay (DER), a simple and effective approach that combines rehearsal and Knowledge Distillation. Inspired by the way our minds constantly rewrite past recollections and set expectations for the future, we endow our model with the abilities to i) revise its replay memory to welcome novel information regarding past data ii) pave the way for learning yet unseen classes. We show that the application of these strategies leads to remarkable improvements; indeed, the resulting method - termed eXtended-DER (X-DER) - outperforms the state of the art on both standard benchmarks (such as CIFAR-100 and miniImagenet) and a novel one here introduced. To gain a better understanding, we further provide extensive ablation studies that corroborate and extend the findings of our previous research (e.g. the value of Knowledge Distillation and flatter minima in continual learning setups).
    Optimization-Based Algebraic Multigrid Coarsening Using Reinforcement Learning. (arXiv:2106.01854v2 [cs.LG] UPDATED)
    (0 min) Large sparse linear systems of equations are ubiquitous in science and engineering, such as those arising from discretizations of partial differential equations. Algebraic multigrid (AMG) methods are one of the most common methods of solving such linear systems, with an extensive body of underlying mathematical theory. A system of linear equations defines a graph on the set of unknowns and each level of a multigrid solver requires the selection of an appropriate coarse graph along with restriction and interpolation operators that map to and from the coarse representation. The efficiency of the multigrid solver depends critically on this selection and many selection methods have been developed over the years. Recently, it has been demonstrated that it is possible to directly learn the AMG interpolation and restriction operators, given a coarse graph selection. In this paper, we consider the complementary problem of learning to coarsen graphs for a multigrid solver, a necessary step in developing fully learnable AMG methods. We propose a method using a reinforcement learning (RL) agent based on graph neural networks (GNNs), which can learn to perform graph coarsening on small planar training graphs and then be applied to unstructured large planar graphs, assuming bounded node degree. We demonstrate that this method can produce better coarse graphs than existing algorithms, even as the graph size increases and other properties of the graph are varied. We also propose an efficient inference procedure for performing graph coarsening that results in linear time complexity in graph size.
    A Comprehensive Survey on Radio Frequency (RF) Fingerprinting: Traditional Approaches, Deep Learning, and Open Challenges. (arXiv:2201.00680v1 [cs.LG])
    (0 min) Fifth generation (5G) networks and beyond envisions massive Internet of Things (IoT) rollout to support disruptive applications such as extended reality (XR), augmented/virtual reality (AR/VR), industrial automation, autonomous driving, and smart everything which brings together massive and diverse IoT devices occupying the radio frequency (RF) spectrum. Along with spectrum crunch and throughput challenges, such a massive scale of wireless devices exposes unprecedented threat surfaces. RF fingerprinting is heralded as a candidate technology that can be combined with cryptographic and zero-trust security measures to ensure data privacy, confidentiality, and integrity in wireless networks. Motivated by the relevance of this subject in the future communication networks, in this work, we present a comprehensive survey of RF fingerprinting approaches ranging from a traditional view to the most recent deep learning (DL) based algorithms. Existing surveys have mostly focused on a constrained presentation of the wireless fingerprinting approaches, however, many aspects remain untold. In this work, however, we mitigate this by addressing every aspect - background on signal intelligence (SIGINT), applications, relevant DL algorithms, systematic literature review of RF fingerprinting techniques spanning the past two decades, discussion on datasets, and potential research avenues - necessary to elucidate this topic to the reader in an encyclopedic manner.
    Improved Topic modeling in Twitter through Community Pooling. (arXiv:2201.00690v1 [cs.IR])
    (0 min) Social networks play a fundamental role in propagation of information and news. Characterizing the content of the messages becomes vital for different tasks, like breaking news detection, personalized message recommendation, fake users detection, information flow characterization and others. However, Twitter posts are short and often less coherent than other text documents, which makes it challenging to apply text mining algorithms to these datasets efficiently. Tweet-pooling (aggregating tweets into longer documents) has been shown to improve automatic topic decomposition, but the performance achieved in this task varies depending on the pooling method. In this paper, we propose a new pooling scheme for topic modeling in Twitter, which groups tweets whose authors belong to the same community (group of users who mainly interact with each other but not with other groups) on a user interaction graph. We present a complete evaluation of this methodology, state of the art schemes and previous pooling models in terms of the cluster quality, document retrieval tasks performance and supervised machine learning classification score. Results show that our Community polling method outperformed other methods on the majority of metrics in two heterogeneous datasets, while also reducing the running time. This is useful when dealing with big amounts of noisy and short user-generated social media texts. Overall, our findings contribute to an improved methodology for identifying the latent topics in a Twitter dataset, without the need of modifying the basic machinery of a topic decomposition model.
    Learning cortical representations through perturbed and adversarial dreaming. (arXiv:2109.04261v2 [q-bio.NC] UPDATED)
    (0 min) Humans and other animals learn to extract general concepts from sensory experience without extensive teaching. This ability is thought to be facilitated by offline states like sleep where previous experiences are systemically replayed. However, the characteristic creative nature of dreams suggests that learning semantic representations may go beyond merely replaying previous experiences. We support this hypothesis by implementing a cortical architecture inspired by generative adversarial networks (GANs). Learning in our model is organized across three different global brain states mimicking wakefulness, NREM and REM sleep, optimizing different, but complementary objective functions. We train the model on standard datasets of natural images and evaluate the quality of the learned representations. Our results suggest that generating new, virtual sensory inputs via adversarial dreaming during REM sleep is essential for extracting semantic concepts, while replaying episodic memories via perturbed dreaming during NREM sleep improves the robustness of latent representations. The model provides a new computational perspective on sleep states, memory replay and dreams and suggests a cortical implementation of GANs.
    Game of GANs: Game-Theoretical Models for Generative Adversarial Networks. (arXiv:2106.06976v3 [cs.LG] UPDATED)
    (0 min) Generative Adversarial Networks (GANs) have recently attracted considerable attention in the AI community due to its ability to generate high-quality data of significant statistical resemblance to real data. Fundamentally, GAN is a game between two neural networks trained in an adversarial manner to reach a zero-sum Nash equilibrium profile. Despite the improvement accomplished in GANs in the last few years, several issues remain to be solved. This paper reviews the literature on the game theoretic aspects of GANs and addresses how game theory models can address specific challenges of generative model and improve the GAN's performance. We first present some preliminaries, including the basic GAN model and some game theory background. We then present taxonomy to classify state-of-the-art solutions into three main categories: modified game models, modified architectures, and modified learning methods. The classification is based on modifications made to the basic GAN model by proposed game-theoretic approaches in the literature. We then explore the objectives of each category and discuss recent works in each category. Finally, we discuss the remaining challenges in this field and present future research directions.
    Machine learning approaches for localized lockdown during COVID-19: a case study analysis. (arXiv:2201.00715v1 [cs.LG])
    (0 min) At the end of 2019, the latest novel coronavirus Sars-CoV-2 emerged as a significant acute respiratory disease that has become a global pandemic. Countries like Brazil have had difficulty in dealing with the virus due to the high socioeconomic difference of states and municipalities. Therefore, this study presents a new approach using different machine learning and deep learning algorithms applied to Brazilian COVID-19 data. First, a clustering algorithm is used to identify counties with similar sociodemographic behavior, while Benford's law is used to check for data manipulation. Based on these results we are able to correctly model SARIMA models based on the clusters to predict new daily cases. The unsupervised machine learning techniques optimized the process of defining the parameters of the SARIMA model. This framework can also be useful to propose confinement scenarios during the so-called second wave. We have used the 645 counties from S\~ao Paulo state, the most populous state in Brazil. However, this methodology can be used in other states or countries. This paper demonstrates how different techniques of machine learning, deep learning, data mining and statistics can be used together to produce important results when dealing with pandemic data. Although the findings cannot be used exclusively to assess and influence policy decisions, they offer an alternative to the ineffective measures that have been used.
    Improving Actor-Critic Reinforcement Learning via Hamiltonian Monte Carlo Method. (arXiv:2103.12020v3 [cs.LG] UPDATED)
    (0 min) The actor-critic RL is widely used in various robotic control tasks. By viewing the actor-critic RL from the perspective of variational inference (VI), the policy network is trained to obtain the approximate posterior of actions given the optimality criteria. However, in practice, the actor-critic RL may yield suboptimal policy estimates due to the amortization gap and insufficient exploration. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate the policy network of actor-critic RL with HMC, which is termed as {\it Hamiltonian Policy}. As such we propose to evolve actions from the base policy according to HMC, and our proposed method has many benefits. First, HMC can improve the policy distribution to better approximate the posterior and hence reduce the amortization gap. Second, HMC can also guide the exploration more to the regions of action spaces with higher Q values, enhancing the exploration efficiency. Further, instead of directly applying HMC into RL, we propose a new leapfrog operator to simulate the Hamiltonian dynamics. Finally, in safe RL problems, we find that the proposed method can not only improve the achieved return, but also reduce safety constraint violations by discarding potentially unsafe actions. With comprehensive empirical experiments on continuous control baselines, including MuJoCo and PyBullet Roboschool, we show that the proposed approach is a data-efficient and easy-to-implement improvement over previous actor-critic methods.
    An Efficient and Accurate Rough Set for Feature Selection, Classification and Knowledge Representation. (arXiv:2201.00436v1 [cs.LG])
    (0 min) This paper present a strong data mining method based on rough set, which can realize feature selection, classification and knowledge representation at the same time. Rough set has good interpretability, and is a popular method for feature selections. But low efficiency and low accuracy are its main drawbacks that limits its application ability. In this paper,corresponding to the accuracy, we first find the ineffectiveness of rough set because of overfitting, especially in processing noise attribute, and propose a robust measurement for an attribute, called relative importance.we proposed the concept of "rough concept tree" for knowledge representation and classification. Experimental results on public benchmark data sets show that the proposed framework achieves higher accurcy than seven popular or the state-of-the-art feature selection methods.
    Optimal Sampling Gaps for Adaptive Submodular Maximization. (arXiv:2104.01750v4 [cs.LG] UPDATED)
    (0 min) Running machine learning algorithms on large and rapidly growing volumes of data is often computationally expensive, one common trick to reduce the size of a data set, and thus reduce the computational cost of machine learning algorithms, is \emph{probability sampling}. It creates a sampled data set by including each data point from the original data set with a known probability. Although the benefit of running machine learning algorithms on the reduced data set is obvious, one major concern is that the performance of the solution obtained from samples might be much worse than that of the optimal solution when using the full data set. In this paper, we examine the performance loss caused by probability sampling in the context of adaptive submodular maximization. We consider a simple probability sampling method which selects each data point with probability $r\in[0,1]$. If we set the sampling rate $r=1$, our problem reduces to finding a solution based on the original full data set. We define sampling gap as the largest ratio between the optimal solution obtained from the full data set and the optimal solution obtained from the samples, over independence systems. %It captures the performance loss of the optimal solution caused by the probability sampling. Our main contribution is to show that if the utility function is policywise submodular, then for a given sampling rate $r$, the sampling gap is both upper bounded and lower bounded by $1/r$. One immediate implication of our result is that if we can find an $\alpha$-approximation solution based on a sampled data set (which is sampled at sampling rate $r$), then this solution achieves an $\alpha r$ approximation ratio against the optimal solution when using the full data set.
    An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification. (arXiv:2112.13236v2 [cs.CR] UPDATED)
    (0 min) Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.
    Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network Inference. (arXiv:2112.02346v2 [cs.LG] UPDATED)
    (0 min) FPGA-specific DNN architectures using the native LUTs as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy tradeoffs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this paper, we propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs than via the direct use of off-the-shelf, hand-designed networks. Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K. Choosing appropriate K a priori is challenging, and doing so at even high granularity, e.g. per layer, is a time-consuming and error-prone process that leaves FPGAs' spatial flexibility underexploited. Furthermore, prior works see LUT inputs connected randomly, which does not guarantee a good choice of network topology. To address these issues, we propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference. By removing LUT inputs determined to be of low importance, our method increases the efficiency of the resultant accelerators. Our GPU-friendly solution to LUT input removal is capable of processing large topologies during their training with negligible slowdown. With logic shrinkage, we better the area and energy efficiency of the best-performing LUTNet implementation of the CNV network classifying CIFAR-10 by 1.54x and 1.31x, respectively, while matching its accuracy. This implementation also reaches 2.71x the area efficiency of an equally accurate, heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment of logic shrinkage results in a post-synthesis area reduction of 2.67x vs LUTNet, allowing for implementation that was previously impossible on today's largest FPGAs.
    Automatic Pharma News Categorization. (arXiv:2201.00688v1 [cs.IR])
    (0 min) We use a text dataset consisting of 23 news categories relevant to pharma information science, in order to compare the fine-tuning performance of multiple transformer models in a classification task. Using a well-balanced dataset with multiple autoregressive and autocoding transformation models, we compare their fine-tuning performance. To validate the winning approach, we perform diagnostics of model behavior on mispredicted instances, including inspection of category-wise metrics, evaluation of prediction certainty and assessment of latent space representations. Lastly, we propose an ensemble model consisting of the top performing individual predictors and demonstrate that this approach offers a modest improvement in the F1 metric.
    Conditional Selective Inference for Robust Regression and Outlier Detection using Piecewise-Linear Homotopy Continuation. (arXiv:2104.10840v2 [stat.ML] UPDATED)
    (0 min) In practical data analysis under noisy environment, it is common to first use robust methods to identify outliers, and then to conduct further analysis after removing the outliers. In this paper, we consider statistical inference of the model estimated after outliers are removed, which can be interpreted as a selective inference (SI) problem. To use conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing methods cannot be directly used here because they are applicable to the case where the selection events can be represented by linear/quadratic constraints. In this paper, we propose a conditional SI method for popular robust regressions by using homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.
    Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods. (arXiv:1907.09358v3 [cs.CV] UPDATED)
    (0 min) Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey stimulates innovative thoughts and ideas to address the existing challenges and build new applications.
    Scalable semi-supervised dimensionality reduction with GPU-accelerated EmbedSOM. (arXiv:2201.00701v1 [cs.LG])
    (0 min) Dimensionality reduction methods have found vast application as visualization tools in diverse areas of science. Although many different methods exist, their performance is often insufficient for providing quick insight into many contemporary datasets, and the unsupervised mode of use prevents the users from utilizing the methods for dataset exploration and fine-tuning the details for improved visualization quality. We present BlosSOM, a high-performance semi-supervised dimensionality reduction software for interactive user-steerable visualization of high-dimensional datasets with millions of individual data points. BlosSOM builds on a GPU-accelerated implementation of the EmbedSOM algorithm, complemented by several landmark-based algorithms for interfacing the unsupervised model learning algorithms with the user supervision. We show the application of BlosSOM on realistic datasets, where it helps to produce high-quality visualizations that incorporate user-specified layout and focus on certain features. We believe the semi-supervised dimensionality reduction will improve the data visualization possibilities for science areas such as single-cell cytometry, and provide a fast and efficient base methodology for new directions in dataset exploration and annotation.
    A Family of Deep Learning Architectures for Channel Estimation and Hybrid Beamforming in Multi-Carrier mm-Wave Massive MIMO. (arXiv:1912.10036v6 [eess.SP] UPDATED)
    (0 min) Hybrid analog and digital beamforming transceivers are instrumental in addressing the challenge of expensive hardware and high training overheads in the next generation millimeter-wave (mm-Wave) massive MIMO (multiple-input multiple-output) systems. However, lack of fully digital beamforming in hybrid architectures and short coherence times at mm-Wave impose additional constraints on the channel estimation. Prior works on addressing these challenges have focused largely on narrowband channels wherein optimization-based or greedy algorithms were employed to derive hybrid beamformers. In this paper, we introduce a deep learning (DL) approach for channel estimation and hybrid beamforming for frequency-selective, wideband mm-Wave systems. In particular, we consider a massive MIMO Orthogonal Frequency Division Multiplexing (MIMO-OFDM) system and propose three different DL frameworks comprising convolutional neural networks (CNNs), which accept the raw data of received signal as input and yield channel estimates and the hybrid beamformers at the output. We also introduce both offline and online prediction schemes. Numerical experiments demonstrate that, compared to the current state-of-the-art optimization and DL methods, our approach provides higher spectral efficiency, lesser computational cost and fewer number of pilot signals, and higher tolerance against the deviations in the received pilot data, corrupted channel matrix, and propagation environment.
    Convergence of Random Reshuffling Under The Kurdyka-{\L}ojasiewicz Inequality. (arXiv:2110.04926v2 [math.OC] UPDATED)
    (0 min) We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence results for RR with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense. In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes. When the KL exponent lies in $[0,\frac12]$, the convergence is at a rate of $\mathcal{O}(t^{-1})$ with $t$ counting the iteration number. When the KL exponent belongs to $(\frac12,1)$, our derived convergence rate is of the form $\mathcal{O}(t^{-q})$ with $q\in (0,1)$ depending on the KL exponent. The standard KL inequality-based convergence analysis framework only applies to algorithms with a certain descent property. We conduct a novel convergence analysis for the non-descent RR method with diminishing step sizes based on the KL inequality, which generalizes the standard KL framework. We summarize our main steps and core ideas in an informal analysis framework, which is of independent interest. As a direct application of this framework, we also establish similar strong limit-point convergence results for the reshuffled proximal point method.
    Boosting Contrastive Self-Supervised Learning with False Negative Cancellation. (arXiv:2011.11765v2 [cs.CV] UPDATED)
    (0 min) Self-supervised representation learning has made significant leaps fueled by progress in contrastive learning, which seeks to learn transformations that embed positive input pairs nearby, while pushing negative pairs far apart. While positive pairs can be generated reliably (e.g., as different views of the same image), it is difficult to accurately establish negative pairs, defined as samples from different images regardless of their semantic content or visual features. A fundamental problem in contrastive learning is mitigating the effects of false negatives. Contrasting false negatives induces two critical issues in representation learning: discarding semantic information and slow convergence. In this paper, we propose novel approaches to identify false negatives, as well as two strategies to mitigate their effect, i.e. false negative elimination and attraction, while systematically performing rigorous evaluations to study this problem in detail. Our method exhibits consistent improvements over existing contrastive learning-based methods. Without labels, we identify false negatives with 40% accuracy among 1000 semantic classes on ImageNet, and achieve 5.8% absolute improvement in top-1 accuracy over the previous state-of-the-art when finetuning with 1% labels. Our code is available at https://github.com/google-research/fnc.
    Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI. (arXiv:2201.00650v1 [cs.LG])
    (0 min) The second edition of Deep Learning Interviews is home to hundreds of fully-solved problems, from a wide range of key topics in AI. It is designed to both rehearse interview or exam specific topics and provide machine learning M.Sc./Ph.D. students, and those awaiting an interview a well-organized overview of the field. The problems it poses are tough enough to cut your teeth on and to dramatically improve your skills-but they're framed within thought-provoking questions and engaging stories. That is what makes the volume so specifically valuable to students and job seekers: it provides them with the ability to speak confidently and quickly on any relevant topic, to answer technical questions clearly and correctly, and to fully understand the purpose and meaning of interview questions and answers. Those are powerful, indispensable advantages to have when walking into the interview room. The book's contents is a large inventory of numerous topics relevant to DL job interviews and graduate level exams. That places this work at the forefront of the growing trend in science to teach a core set of practical mathematical and computational skills. It is widely accepted that the training of every computer scientist must include the fundamental theorems of ML, and AI appears in the curriculum of nearly every university. This volume is designed as an excellent reference for graduates of such programs.
    Boosting the Certified Robustness of L-infinity Distance Nets. (arXiv:2110.06850v3 [cs.LG] UPDATED)
    (0 min) Recently, Zhang et al.(2021) developed a new neural network architecture based on $\ell_\infty$-distance functions, which naturally possesses certified $\ell_\infty$ robustness by its construction. Despite rigorous theoretical guarantees, the model so far can only achieve comparable performance to conventional networks. In this paper, we make the following two contributions: $\mathrm{(i)}$ We demonstrate that $\ell_\infty$-distance nets enjoy a fundamental advantage in certified robustness over conventional networks (under typical certification approaches); $\mathrm{(ii)}$ With an improved training process we are able to significantly boost the certified accuracy of $\ell_\infty$-distance nets. Our training approach largely alleviates the optimization problem that arose in the previous training scheme, in particular, the unexpected large Lipschitz constant due to the use of a crucial trick called $\ell_p$-relaxation. The core of our training approach is a novel objective function that combines scaled cross-entropy loss and clipped hinge loss with a decaying mixing coefficient. Experiments show that using the proposed training strategy, the certified accuracy of $\ell_\infty$-distance net can be dramatically improved from 33.30% to 40.06% on CIFAR-10 ($\epsilon=8/255$), meanwhile outperforming other approaches in this area by a large margin. Our results clearly demonstrate the effectiveness and potential of $\ell_\infty$-distance net for certified robustness. Codes are available at https://github.com/zbh2047/L_inf-dist-net-v2.
    Resource-Efficient and Delay-Aware Federated Learning Design under Edge Heterogeneity. (arXiv:2112.13926v2 [cs.NI] UPDATED)
    (0 min) Federated learning (FL) has emerged as a popular methodology for distributing machine learning across wireless edge devices. In this work, we consider optimizing the tradeoff between model performance and resource utilization in FL, under device-server communication delays and device computation heterogeneity. Our proposed StoFedDelAv algorithm incorporates a local-global model combiner into the FL synchronization step. We theoretically characterize the convergence behavior of StoFedDelAv and obtain the optimal combiner weights, which consider the global model delay and expected local gradient error at each device. We then formulate a network-aware optimization problem which tunes the minibatch sizes of the devices to jointly minimize energy consumption and machine learning training loss, and solve the non-convex problem through a series of convex approximations. Our simulations reveal that StoFedDelAv outperforms the current art in FL in terms of model convergence speed and network resource utilization when the minibatch size and the combiner weights are adjusted. Additionally, our method can reduce the number of uplink communication rounds required during the model training period to reach the same accuracy.
    AugMax: Adversarial Composition of Random Augmentations for Robust Training. (arXiv:2110.13771v3 [cs.CV] UPDATED)
    (0 min) Data augmentation is a simple yet effective way to improve the robustness of deep neural networks (DNNs). Diversity and hardness are two complementary dimensions of data augmentation to achieve robustness. For example, AugMix explores random compositions of a diverse set of augmentations to enhance broader coverage, while adversarial training generates adversarially hard samples to spot the weakness. Motivated by this, we propose a data augmentation framework, termed AugMax, to unify the two aspects of diversity and hardness. AugMax first randomly samples multiple augmentation operators and then learns an adversarial mixture of the selected operators. Being a stronger form of data augmentation, AugMax leads to a significantly augmented input distribution which makes model training more challenging. To solve this problem, we further design a disentangled normalization module, termed DuBIN (Dual-Batch-and-Instance Normalization), that disentangles the instance-wise feature heterogeneity arising from AugMax. Experiments show that AugMax-DuBIN leads to significantly improved out-of-distribution robustness, outperforming prior arts by 3.03%, 3.49%, 1.82% and 0.71% on CIFAR10-C, CIFAR100-C, Tiny ImageNet-C and ImageNet-C. Codes and pretrained models are available: https://github.com/VITA-Group/AugMax.
    Parkour Spot ID: Feature Matching in Satellite and Street view images using Deep Learning. (arXiv:2201.00377v1 [cs.CV])
    (0 min) How to find places that are not indexed by Google Maps? We propose an intuitive method and framework to locate places based on their distinctive spatial features. The method uses satellite and street view images in machine vision approaches to classify locations. If we can classify locations, we just need to repeat for non-overlapping locations in our area of interest. We assess the proposed system in finding Parkour spots in the campus of Arizona State University. The results are very satisfactory, having found more than 25 new Parkour spots, with a rate of true positives above 60%.
    Model-free Neural Counterfactual Regret Minimization with Bootstrap Learning. (arXiv:2012.01870v3 [cs.LG] UPDATED)
    (0 min) Counterfactual Regret Minimization (CFR) has achieved many fascinating results in solving large-scale Imperfect Information Games (IIGs). Neural network approximation CFR (neural CFR) is one of the promising techniques that can reduce computation and memory consumption by generalizing decision information between similar states. Current neural CFR algorithms have to approximate cumulative regrets. However, efficient and accurate approximation in a large-scale IIG is still a tough challenge. In this paper, a new CFR variant, Recursive CFR (ReCFR), is proposed. In ReCFR, Recursive Substitute Values (RSVs) are learned and used to replace cumulative regrets. It is proven that ReCFR can converge to a Nash equilibrium at a rate of $O(\frac{1}{\sqrt{T}})$. Based on ReCFR, a new model-free neural CFR with bootstrap learning, Neural ReCFR-B, is proposed. Due to the recursive and non-cumulative nature of RSVs, Neural ReCFR-B has lower-variance training targets than other neural CFRs. Experimental results show that Neural ReCFR-B is competitive with the state-of-the-art neural CFR algorithms at a much lower training cost.
    MHATC: Autism Spectrum Disorder identification utilizing multi-head attention encoder along with temporal consolidation modules. (arXiv:2201.00404v1 [q-bio.NC])
    (0 min) Resting-state fMRI is commonly used for diagnosing Autism Spectrum Disorder (ASD) by using network-based functional connectivity. It has been shown that ASD is associated with brain regions and their inter-connections. However, discriminating based on connectivity patterns among imaging data of the control population and that of ASD patients' brains is a non-trivial task. In order to tackle said classification task, we propose a novel deep learning architecture (MHATC) consisting of multi-head attention and temporal consolidation modules for classifying an individual as a patient of ASD. The devised architecture results from an in-depth analysis of the limitations of current deep neural network solutions for similar applications. Our approach is not only robust but computationally efficient, which can allow its adoption in a variety of other research and clinical settings.
    MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning. (arXiv:2201.00012v1 [cs.LG])
    (0 min) Inferring reward functions from demonstrations and pairwise preferences are auspicious approaches for aligning Reinforcement Learning (RL) agents with human intentions. However, state-of-the art methods typically focus on learning a single reward model, thus rendering it difficult to trade off different reward functions from multiple experts. We propose Multi-Objective Reinforced Active Learning (MORAL), a novel method for combining diverse demonstrations of social norms into a Pareto-optimal policy. Through maintaining a distribution over scalarization weights, our approach is able to interactively tune a deep RL agent towards a variety of preferences, while eliminating the need for computing multiple policies. We empirically demonstrate the effectiveness of MORAL in two scenarios, which model a delivery and an emergency task that require an agent to act in the presence of normative conflicts. Overall, we consider our research a step towards multi-objective RL with learned rewards, bridging the gap between current reward learning and machine ethics literature.
    maskGRU: Tracking Small Objects in the Presence of Large Background Motions. (arXiv:2201.00467v1 [cs.CV])
    (0 min) We propose a recurrent neural network-based spatio-temporal framework named maskGRU for the detection and tracking of small objects in videos. While there have been many developments in the area of object tracking in recent years, tracking a small moving object amid other moving objects and actors (such as a ball amid moving players in sports footage) continues to be a difficult task. Existing spatio-temporal networks, such as convolutional Gated Recurrent Units (convGRUs), are difficult to train and have trouble accurately tracking small objects under such conditions. To overcome these difficulties, we developed the maskGRU framework that uses a weighted sum of the internal hidden state produced by a convGRU and a 3-channel mask of the tracked object's predicted bounding box as the hidden state to be used at the next time step of the underlying convGRU. We believe the technique of incorporating a mask into the hidden state through a weighted sum has two benefits: controlling the effect of exploding gradients and introducing an attention-like mechanism into the network by indicating where in the previous video frame the object is located. Our experiments show that maskGRU outperforms convGRU at tracking objects that are small relative to the video resolution even in the presence of other moving objects.
    Transformer Embeddings of Irregularly Spaced Events and Their Participants. (arXiv:2201.00044v1 [cs.LG])
    (0 min) We propose an approach to modeling irregularly spaced sequences of discrete events. We begin with a continuous-time variant of the Transformer, which was originally formulated (Vaswani et al., 2017) for sequences without timestamps. We embed a possible event (or other boolean fact) at time $t$ by using attention over the events that occurred at times $< t$ (and the facts that were true when they occurred). We control this attention using pattern-matching logic rules that relate events and facts that share participants. These rules determine which previous events will be attended to, as well as how to transform the embeddings of the events and facts into the attentional queries, keys, and values. Other logic rules describe how to change the set of facts in response to events. Our approach closely follows Mei et al. (2020a), and adopts their Datalog Through Time formalism for logic rules. As in that work, a domain expert first writes a set of logic rules that establishes the set of possible events and other facts at each time $t$. Each possible event or other fact is embedded using a neural architecture that is derived from the rules that established it. Our only difference from Mei et al. (2020a) is that we derive a flatter, attention-based neural architecture whereas they used a more serial LSTM architecture. We find that our attention-based approach performs about equally well on the RoboCup dataset, where the logic rules play an important role in improving performance. We also compared these two methods with two previous attention-based methods (Zuo et al., 2020; Zhang et al., 2020a) on simpler synthetic and real domains without logic rules, and found our proposed approach to be at least as good, and sometimes better, than each of the other three methods.
    The DONUT Approach to EnsembleCombination Forecasting. (arXiv:2201.00426v1 [cs.LG])
    (0 min) This paper presents an ensemble forecasting method that shows strong results on the M4Competition dataset by decreasing feature and model selection assumptions, termed DONUT(DO Not UTilize human assumptions). Our assumption reductions, consisting mainly of auto-generated features and a more diverse model pool for the ensemble, significantly outperforms the statistical-feature-based ensemble method FFORMA by Montero-Manso et al. (2020). Furthermore, we investigate feature extraction with a Long short-term memory Network(LSTM) Autoencoder and find that such features contain crucial information not captured by traditional statistical feature approaches. The ensemble weighting model uses both LSTM features and statistical features to combine the models accurately. Analysis of feature importance and interaction show a slight superiority for LSTM features over the statistical ones alone. Clustering analysis shows that different essential LSTM features are different from most statistical features and each other. We also find that increasing the solution space of the weighting model by augmenting the ensemble with new models is something the weighting model learns to use, explaining part of the accuracy gains. Lastly, we present a formal ex-post-facto analysis of optimal combination and selection for ensembles, quantifying differences through linear optimization on the M4 dataset. We also include a short proof that model combination is superior to model selection, a posteriori.
    Have I done enough planning or should I plan more?. (arXiv:2201.00764v1 [cs.AI])
    (0 min) People's decisions about how to allocate their limited computational resources are essential to human intelligence. An important component of this metacognitive ability is deciding whether to continue thinking about what to do and move on to the next decision. Here, we show that people acquire this ability through learning and reverse-engineer the underlying learning mechanisms. Using a process-tracing paradigm that externalises human planning, we find that people quickly adapt how much planning they perform to the cost and benefit of planning. To discover the underlying metacognitive learning mechanisms we augmented a set of reinforcement learning models with metacognitive features and performed Bayesian model selection. Our results suggest that the metacognitive ability to adjust the amount of planning might be learned through a policy-gradient mechanism that is guided by metacognitive pseudo-rewards that communicate the value of planning.
    Confidence-Aware Multi-Teacher Knowledge Distillation. (arXiv:2201.00007v1 [cs.LG])
    (0 min) Knowledge distillation is initially introduced to utilize additional supervision from a single teacher model for the student model training. To boost the student performance, some recent variants attempt to exploit diverse knowledge sources from multiple teachers. However, existing studies mainly integrate knowledge from diverse sources by averaging over multiple teacher predictions or combining them using other various label-free strategies, which may mislead student in the presence of low-quality teacher predictions. To tackle this problem, we propose Confidence-Aware Multi-teacher Knowledge Distillation (CA-MKD), which adaptively assigns sample-wise reliability for each teacher prediction with the help of ground-truth labels, with those teacher predictions close to one-hot labels assigned large weights. Besides, CA-MKD incorporates intermediate layers to further improve student performance. Extensive experiments show that our CA-MKD consistently outperforms all compared state-of-the-art methods across various teacher-student architectures.
    MLOps -- Definitions, Tools and Challenges. (arXiv:2201.00162v1 [cs.LG])
    (0 min) This paper is an overview of the Machine Learning Operations (MLOps) area. Our aim is to define the operation and the components of such systems by highlighting the current problems and trends. In this context, we present the different tools and their usefulness in order to provide the corresponding guidelines. Moreover, the connection between MLOps and AutoML (Automated Machine Learning) is identified and how this combination could work is proposed.
    PowerGraph: Using neural networks and principal components to multivariate statistical power trade-offs. (arXiv:2201.00719v1 [stat.ME])
    (0 min) It is increasingly acknowledged that a priori statistical power estimation for planned studies with multiple model parameters is inherently a multivariate problem. Power for individual parameters of interest cannot be reliably estimated univariately because sampling variably in, correlation with, and variance explained relative to one parameter will impact the power for another parameter, all usual univariate considerations being equal. Explicit solutions in such cases, especially for models with many parameters, are either impractical or impossible to solve, leaving researchers with the prevailing method of simulating power. However, point estimates for a vector of model parameters are uncertain, and the impact of inaccuracy is unknown. In such cases, sensitivity analysis is recommended such that multiple combinations of possible observable parameter vectors are simulated to understand power trade-offs. A limitation to this approach is that it is computationally expensive to generate sufficient sensitivity combinations to accurately map the power trade-off function in increasingly high dimensional spaces for the models that social scientists estimate. This paper explores the efficient estimation and graphing of statistical power for a study over varying model parameter combinations. Optimally powering a study is crucial to ensure a minimum probability of finding the hypothesized effect. We first demonstrate the impact of varying parameter values on power for specific hypotheses of interest and quantify the computational intensity of computing such a graph for a given level of precision. Finally, we propose a simple and generalizable machine learning inspired solution to cut the computational cost to less than 7\% of what could be called a brute force approach. [abridged]
    Transfer RL across Observation Feature Spaces via Model-Based Regularization. (arXiv:2201.00248v1 [cs.LG])
    (0 min) In many reinforcement learning (RL) applications, the observation space is specified by human developers and restricted by physical realizations, and may thus be subject to dramatic changes over time (e.g. increased number of observable features). However, when the observation space changes, the previous policy will likely fail due to the mismatch of input features, and another policy must be trained from scratch, which is inefficient in terms of computation and sample complexity. Following theoretical insights, we propose a novel algorithm which extracts the latent-space dynamics in the source task, and transfers the dynamics model to the target task to use as a model-based regularizer. Our algorithm works for drastic changes of observation space (e.g. from vector-based observation to image-based observation), without any inter-task mapping or any prior knowledge of the target task. Empirical results show that our algorithm significantly improves the efficiency and stability of learning in the target task.
    Swift and Sure: Hardness-aware Contrastive Learning for Low-dimensional Knowledge Graph Embeddings. (arXiv:2201.00565v1 [cs.LG])
    (0 min) Knowledge graph embedding (KGE) has drawn great attention due to its potential in automatic knowledge graph (KG) completion and knowledge-driven tasks. However, recent KGE models suffer from high training cost and large storage space, thus limiting their practicality in real-world applications. To address this challenge, based on the latest findings in the field of Contrastive Learning, we propose a novel KGE training framework called Hardness-aware Low-dimensional Embedding (HaLE). Instead of the traditional Negative Sampling, we design a new loss function based on query sampling that can balance two important training targets, Alignment and Uniformity. Furthermore, we analyze the hardness-aware ability of recent low-dimensional hyperbolic models and propose a lightweight hardness-aware activation mechanism, which can help the KGE models focus on hard instances and speed up convergence. The experimental results show that in the limited training time, HaLE can effectively improve the performance and training speed of KGE models on five commonly-used datasets. The HaLE-trained models can obtain a high prediction accuracy after training few minutes and are competitive compared to the state-of-the-art models in both low- and high-dimensional conditions.
    Lung-Originated Tumor Segmentation from Computed Tomography Scan (LOTUS) Benchmark. (arXiv:2201.00458v1 [eess.IV])
    (0 min) Lung cancer is one of the deadliest cancers, and in part its effective diagnosis and treatment depend on the accurate delineation of the tumor. Human-centered segmentation, which is currently the most common approach, is subject to inter-observer variability, and is also time-consuming, considering the fact that only experts are capable of providing annotations. Automatic and semi-automatic tumor segmentation methods have recently shown promising results. However, as different researchers have validated their algorithms using various datasets and performance metrics, reliably evaluating these methods is still an open challenge. The goal of the Lung-Originated Tumor Segmentation from Computed Tomography Scan (LOTUS) Benchmark created through 2018 IEEE Video and Image Processing (VIP) Cup competition, is to provide a unique dataset and pre-defined metrics, so that different researchers can develop and evaluate their methods in a unified fashion. The 2018 VIP Cup started with a global engagement from 42 countries to access the competition data. At the registration stage, there were 129 members clustered into 28 teams from 10 countries, out of which 9 teams made it to the final stage and 6 teams successfully completed all the required tasks. In a nutshell, all the algorithms proposed during the competition, are based on deep learning models combined with a false positive reduction technique. Methods developed by the three finalists show promising results in tumor segmentation, however, more effort should be put into reducing the false positive rate. This competition manuscript presents an overview of the VIP-Cup challenge, along with the proposed algorithms and results.
    Neural network training under semidefinite constraints. (arXiv:2201.00632v1 [cs.LG])
    (0 min) This paper is concerned with the training of neural networks (NNs) under semidefinite constraints. This type of training problems has recently gained popularity since semidefinite constraints can be used to verify interesting properties for NNs that include, e.g., the estimation of an upper bound on the Lipschitz constant, which relates to the robustness of an NN, or the stability of dynamic systems with NN controllers. The utilized semidefinite constraints are based on sector constraints satisfied by the underlying activation functions. Unfortunately, one of the biggest bottlenecks of these new results is the required computational effort for incorporating the semidefinite constraints into the training of NNs which is limiting their scalability to large NNs. We address this challenge by developing interior point methods for NN training that we implement using barrier functions for semidefinite constraints. In order to efficiently compute the gradients of the barrier terms, we exploit the structure of the semidefinite constraints. In experiments, we demonstrate the superior efficiency of our training method over previous approaches, which allows us, e.g., to use semidefinite constraints in the training of Wasserstein generative adversarial networks, where the discriminator must satisfy a Lipschitz condition.
    An Efficient Combinatorial Optimization Model Using Learning-to-Rank Distillation. (arXiv:2201.00695v1 [cs.IR])
    (0 min) Recently, deep reinforcement learning (RL) has proven its feasibility in solving combinatorial optimization problems (COPs). The learning-to-rank techniques have been studied in the field of information retrieval. While several COPs can be formulated as the prioritization of input items, as is common in the information retrieval, it has not been fully explored how the learning-to-rank techniques can be incorporated into deep RL for COPs. In this paper, we present the learning-to-rank distillation-based COP framework, where a high-performance ranking policy obtained by RL for a COP can be distilled into a non-iterative, simple model, thereby achieving a low-latency COP solver. Specifically, we employ the approximated ranking distillation to render a score-based ranking model learnable via gradient descent. Furthermore, we use the efficient sequence sampling to improve the inference performance with a limited delay. With the framework, we demonstrate that a distilled model not only achieves comparable performance to its respective, high-performance RL, but also provides several times faster inferences. We evaluate the framework with several COPs such as priority-based task scheduling and multidimensional knapsack, demonstrating the benefits of the framework in terms of inference latency and performance.
    A Lightweight and Accurate Spatial-Temporal Transformer for Traffic Forecasting. (arXiv:2201.00008v1 [cs.LG])
    (0 min) We study the forecasting problem for traffic with dynamic, possibly periodical, and joint spatial-temporal dependency between regions. Given the aggregated inflow and outflow traffic of regions in a city from time slots 0 to t-1, we predict the traffic at time t at any region. Prior arts in the area often consider the spatial and temporal dependencies in a decoupled manner or are rather computationally intensive in training with a large number of hyper-parameters to tune. We propose ST-TIS, a novel, lightweight, and accurate Spatial-Temporal Transformer with information fusion and region sampling for traffic forecasting. ST-TIS extends the canonical Transformer with information fusion and region sampling. The information fusion module captures the complex spatial-temporal dependency between regions. The region sampling module is to improve the efficiency and prediction accuracy, cutting the computation complexity for dependency learning from $O(n^2)$ to $O(n\sqrt{n})$, where n is the number of regions. With far fewer parameters than state-of-the-art models, the offline training of our model is significantly faster in terms of tuning and computation (with a reduction of up to $90\%$ on training time and network parameters). Notwithstanding such training efficiency, extensive experiments show that ST-TIS is substantially more accurate in online prediction than state-of-the-art approaches (with an average improvement of up to $11\%$ on RMSE, $14\%$ on MAPE).
    Improving Out-of-Distribution Robustness via Selective Augmentation. (arXiv:2201.00299v1 [cs.LG])
    (0 min) Machine learning algorithms typically assume that training and test examples are drawn from the same distribution. However, distribution shift is a common problem in real-world applications and can cause models to perform dramatically worse at test time. In this paper, we specifically consider the problems of domain shifts and subpopulation shifts (eg. imbalanced data). While prior works often seek to explicitly regularize internal representations and predictors of the model to be domain invariant, we instead aim to regularize the whole function without restricting the model's internal representations. This leads to a simple mixup-based technique which learns invariant functions via selective augmentation called LISA. LISA selectively interpolates samples either with the same labels but different domains or with the same domain but different labels. We analyze a linear setting and theoretically show how LISA leads to a smaller worst-group error. Empirically, we study the effectiveness of LISA on nine benchmarks ranging from subpopulation shifts to domain shifts, and we find that LISA consistently outperforms other state-of-the-art methods.
    Deep-learning-based upscaling method for geologic models via theory-guided convolutional neural network. (arXiv:2201.00698v1 [cs.LG])
    (0 min) Large-scale or high-resolution geologic models usually comprise a huge number of grid blocks, which can be computationally demanding and time-consuming to solve with numerical simulators. Therefore, it is advantageous to upscale geologic models (e.g., hydraulic conductivity) from fine-scale (high-resolution grids) to coarse-scale systems. Numerical upscaling methods have been proven to be effective and robust for coarsening geologic models, but their efficiency remains to be improved. In this work, a deep-learning-based method is proposed to upscale the fine-scale geologic models, which can assist to improve upscaling efficiency significantly. In the deep learning method, a deep convolutional neural network (CNN) is trained to approximate the relationship between the coarse grid of hydraulic conductivity fields and the hydraulic heads, which can then be utilized to replace the numerical solvers while solving the flow equations for each coarse block. In addition, physical laws (e.g., governing equations and periodic boundary conditions) can also be incorporated into the training process of the deep CNN model, which is termed the theory-guided convolutional neural network (TgCNN). With the physical information considered, dependence on the data volume of training the deep learning models can be reduced greatly. Several subsurface flow cases are introduced to test the performance of the proposed deep-learning-based upscaling method, including 2D and 3D cases, and isotropic and anisotropic cases. The results show that the deep learning method can provide equivalent upscaling accuracy to the numerical method, and efficiency can be improved significantly compared to numerical upscaling.
    Representation Topology Divergence: A Method for Comparing Neural Network Representations. (arXiv:2201.00058v1 [cs.LG])
    (0 min) Comparison of data representations is a complex multi-aspect problem that has not enjoyed a complete solution yet. We propose a method for comparing two data representations. We introduce the Representation Topology Divergence (RTD), measuring the dissimilarity in multi-scale topology between two point clouds of equal size with a one-to-one correspondence between points. The data point clouds are allowed to lie in different ambient spaces. The RTD is one of the few TDA-based practical methods applicable to real machine learning datasets. Experiments show that the proposed RTD agrees with the intuitive assessment of data representation similarity and is sensitive to its topological structure. We apply RTD to gain insights on neural networks representations in computer vision and NLP domains for various problems: training dynamics analysis, data distribution shift, transfer learning, ensemble learning, disentanglement assessment.
    Faster Unbalanced Optimal Transport: Translation invariant Sinkhorn and 1-D Frank-Wolfe. (arXiv:2201.00730v1 [math.OC])
    (0 min) Unbalanced optimal transport (UOT) extends optimal transport (OT) to take into account mass variations to compare distributions. This is crucial to make OT successful in ML applications, making it robust to data normalization and outliers. The baseline algorithm is Sinkhorn, but its convergence speed might be significantly slower for UOT than for OT. In this work, we identify the cause for this deficiency, namely the lack of a global normalization of the iterates, which equivalently corresponds to a translation of the dual OT potentials. Our first contribution leverages this idea to develop a provably accelerated Sinkhorn algorithm (coined 'translation invariant Sinkhorn') for UOT, bridging the computational gap with OT. Our second contribution focusses on 1-D UOT and proposes a Frank-Wolfe solver applied to this translation invariant formulation. The linear oracle of each steps amounts to solving a 1-D OT problems, resulting in a linear time complexity per iteration. Our last contribution extends this method to the computation of UOT barycenter of 1-D measures. Numerical simulations showcase the convergence speed improvement brought by these three approaches.
    Hands-on Bayesian Neural Networks -- a Tutorial for Deep Learning Users. (arXiv:2007.06823v3 [cs.LG] UPDATED)
    (0 min) Modern deep learning methods constitute incredibly powerful tools to tackle a myriad of challenging problems. However, since deep learning methods operate as black boxes, the uncertainty associated with their predictions is often challenging to quantify. Bayesian statistics offer a formalism to understand and quantify the uncertainty associated with deep neural network predictions. This tutorial provides an overview of the relevant literature and a complete toolset to design, implement, train, use and evaluate Bayesian Neural Networks, i.e. Stochastic Artificial Neural Networks trained using Bayesian methods.
    Linear Classifiers in Product Space Forms. (arXiv:2102.10204v3 [cs.LG] UPDATED)
    (0 min) Embedding methods for product spaces are powerful techniques for low-distortion and low-dimensional representation of complex data structures. Here, we address the new problem of linear classification in product space forms -- products of Euclidean, spherical, and hyperbolic spaces. First, we describe novel formulations for linear classifiers on a Riemannian manifold using geodesics and Riemannian metrics which generalize straight lines and inner products in vector spaces. Second, we prove that linear classifiers in $d$-dimensional space forms of any curvature have the same expressive power, i.e., they can shatter exactly $d+1$ points. Third, we formalize linear classifiers in product space forms, describe the first known perceptron and support vector machine classifiers for such spaces and establish rigorous convergence results for perceptrons. Moreover, we prove that the Vapnik-Chervonenkis dimension of linear classifiers in a product space form of dimension $d$ is \emph{at least} $d+1$. We support our theoretical findings with simulations on several datasets, including synthetic data, image data, and single-cell RNA sequencing (scRNA-seq) data. The results show that classification in low-dimensional product space forms for scRNA-seq data offers, on average, a performance improvement of $\sim15\%$ when compared to that in Euclidean spaces of the same dimension.
    Self-attention Multi-view Representation Learning with Diversity-promoting Complementarity. (arXiv:2201.00168v1 [cs.LG])
    (0 min) Multi-view learning attempts to generate a model with a better performance by exploiting the consensus and/or complementarity among multi-view data. However, in terms of complementarity, most existing approaches only can find representations with single complementarity rather than complementary information with diversity. In this paper, to utilize both complementarity and consistency simultaneously, give free rein to the potential of deep learning in grasping diversity-promoting complementarity for multi-view representation learning, we propose a novel supervised multi-view representation learning algorithm, called Self-Attention Multi-View network with Diversity-Promoting Complementarity (SAMVDPC), which exploits the consistency by a group of encoders, uses self-attention to find complementary information entailing diversity. Extensive experiments conducted on eight real-world datasets have demonstrated the effectiveness of our proposed method, and show its superiority over several baseline methods, which only consider single complementary information.
    Layer-Wise Multi-View Decoding for Improved Natural Language Generation. (arXiv:2005.08081v6 [cs.CL] UPDATED)
    (0 min) In sequence-to-sequence learning, e.g., natural language generation, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments and analyses show that we successfully address the hierarchy bypassing problem, require almost negligible parameter increase, and substantially improve the performance of sequence-to-sequence learning with deep representations on five diverse tasks, i.e., machine translation, abstractive summarization, image captioning, video captioning, and medical report generation. In particular, our approach achieves new state-of-the-art results on eight benchmark datasets, including a low-resource machine translation dataset and two low-resource medical report generation datasets.
    Descriptors for Machine Learning Model of Generalized Force Field in Condensed Matter Systems. (arXiv:2201.00798v1 [cond-mat.str-el])
    (0 min) We outline the general framework of machine learning (ML) methods for multi-scale dynamical modeling of condensed matter systems, and in particular of strongly correlated electron models. Complex spatial temporal behaviors in these systems often arise from the interplay between quasi-particles and the emergent dynamical classical degrees of freedom, such as local lattice distortions, spins, and order-parameters. Central to the proposed framework is the ML energy model that, by successfully emulating the time-consuming electronic structure calculation, can accurately predict a local energy based on the classical field in the intermediate neighborhood. In order to properly include the symmetry of the electron Hamiltonian, a crucial component of the ML energy model is the descriptor that transforms the neighborhood configuration into invariant feature variables, which are input to the learning model. A general theory of the descriptor for the classical fields is formulated, and two types of models are distinguished depending on the presence or absence of an internal symmetry for the classical field. Several specific approaches to the descriptor of the classical fields are presented. Our focus is on the group-theoretical method that offers a systematic and rigorous approach to compute invariants based on the bispectrum coefficients. We propose an efficient implementation of the bispectrum method based on the concept of reference irreducible representations. Finally, the implementations of the various descriptors are demonstrated on well-known electronic lattice models.
    FIFA ranking: Evaluation and path forward. (arXiv:2201.00691v1 [cs.IR])
    (0 min) In this work we study the ranking algorithm used by F\'ed\'eration Internationale de Football Association (FIFA); we analyze the parameters it currently uses, show the formal probabilistic model from which it can be derived, and optimize the latter. In particular, analyzing the games since the introduction of the algorithm in 2018, we conclude that the game's "importance" (as defined by FIFA) used in the algorithm is counterproductive from the point of view of the predictive capability of the algorithm. We also postulate the algorithm to be rooted in the formal modelling principle, where the Davidson model proposed in 1970 seems to be an excellent candidate, preserving the form of the algorithm currently used. The results indicate that the predictive capability of the algorithm is notably improved by using the home-field advantage and the explicit model for the draws in the game. Moderate, but notable improvement may be attained by introducing the weighting of the results with the goal differential, which although not rooted in a formal modelling principle, is compatible with the current algorithm and can be tuned to the characteristics of the football competition.
    Toward the Analysis of Graph Neural Networks. (arXiv:2201.00115v1 [cs.SE])
    (0 min) Graph Neural Networks (GNNs) have recently emerged as a robust framework for graph-structured data. They have been applied to many problems such as knowledge graph analysis, social networks recommendation, and even Covid19 detection and vaccine developments. However, unlike other deep neural networks such as Feed Forward Neural Networks (FFNNs), few analyses such as verification and property inferences exist, potentially due to dynamic behaviors of GNNs, which can take arbitrary graphs as input, whereas FFNNs which only take fixed size numerical vectors as inputs. This paper proposes an approach to analyze GNNs by converting them into FFNNs and reusing existing FFNNs analyses. We discuss various designs to ensure the scalability and accuracy of the conversions. We illustrate our method on a study case of node classification. We believe that our approach opens new research directions for understanding and analyzing GNNs.
    HarmoFL: Harmonizing Local and Global Drifts in Federated Learning on Heterogeneous Medical Images. (arXiv:2112.10775v2 [eess.IV] UPDATED)
    (0 min) Multiple medical institutions collaboratively training a model using federated learning (FL) has become a promising solution for maximizing the potential of data-driven models, yet the non-independent and identically distributed (non-iid) data in medical images is still an outstanding challenge in real-world practice. The feature heterogeneity caused by diverse scanners or protocols introduces a drift in the learning process, in both local (client) and global (server) optimizations, which harms the convergence as well as model performance. Many previous works have attempted to address the non-iid issue by tackling the drift locally or globally, but how to jointly solve the two essentially coupled drifts is still unclear. In this work, we concentrate on handling both local and global drifts and introduce a new harmonizing framework called HarmoFL. First, we propose to mitigate the local update drift by normalizing amplitudes of images transformed into the frequency domain to mimic a unified imaging setting, in order to generate a harmonized feature space across local clients. Second, based on harmonized features, we design a client weight perturbation guiding each local model to reach a flat optimum, where a neighborhood area of the local optimal solution has a uniformly low loss. Without any extra communication cost, the perturbation assists the global model to optimize towards a converged optimal solution by aggregating several local flat optima. We have theoretically analyzed the proposed method and empirically conducted extensive experiments on three medical image classification and segmentation tasks, showing that HarmoFL outperforms a set of recent state-of-the-art methods with promising convergence behavior. Code is available at https://github.com/med-air/HarmoFL.
    FamilySeer: Towards Optimized Tensor Codes by Exploiting Computation Subgraph Similarity. (arXiv:2201.00194v1 [cs.LG])
    (0 min) Deploying various deep learning (DL) models efficiently has boosted the research on DL compilers. The difficulty of generating optimized tensor codes drives DL compiler to ask for the auto-tuning approaches, and the increasing demands require increasing auto-tuning efficiency and quality. Currently, the DL compilers partition the input DL models into several subgraphs and leverage the auto-tuning to find the optimal tensor codes of these subgraphs. However, existing auto-tuning approaches usually regard subgraphs as individual ones and overlook the similarities across them, and thus fail to exploit better tensor codes under limited time budgets. We propose FamilySeer, an auto-tuning framework for DL compilers that can generate better tensor codes even with limited time budgets. FamilySeer exploits the similarities and differences among subgraphs can organize them into subgraph families, where the tuning of one subgraph can also improve other subgraphs within the same family. The cost model of each family gets more purified training samples generated by the family and becomes more accurate so that the costly measurements on real hardware can be replaced with the lightweight estimation through cost model. Our experiments show that FamilySeer can generate model codes with the same code performance more efficiently than state-of-the-art auto-tuning frameworks.
    Latent Gaussian Model Boosting. (arXiv:2105.08966v3 [cs.LG] UPDATED)
    (0 min) Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.
    Uncertainty Detection in EEG Neural Decoding Models. (arXiv:2201.00627v1 [eess.SP])
    (0 min) EEG decoding systems based on deep neural networks have been widely used in decision making of brain computer interfaces (BCI). Their predictions, however, can be unreliable given the significant variance and noise in EEG signals. Previous works on EEG analysis mainly focus on the exploration of noise pattern in the source signal, while the uncertainty during the decoding process is largely unexplored. Automatically detecting and quantifying such decoding uncertainty is important for BCI motor imagery applications such as robotic arm control etc. In this work, we proposed an uncertainty estimation model (UE-EEG) to explore the uncertainty during the EEG decoding process, which considers both the uncertainty in the input signal and the uncertainty in the model. The model utilized dropout oriented method for model uncertainty estimation, and Bayesian neural network is adopted for modeling the uncertainty of input data. The model can be integrated into current widely used deep learning classifiers without change of architecture. We performed extensive experiments for uncertainty estimation in both intra-subject EEG decoding and cross-subject EEG decoding on two public motor imagery datasets, where the proposed model achieves significant improvement on the quality of estimated uncertainty and demonstrates the proposed UE-EEG is a useful tool for BCI applications.
    Deep Feature Fusion for Mitosis Counting. (arXiv:2002.03781v2 [cs.CV] UPDATED)
    (0 min) Each woman living in the United States has about 1 in 8 chance of developing invasive breast cancer. The mitotic cell count is one of the most common tests to assess the aggressiveness or grade of breast cancer. In this prognosis, histopathology images must be examined by a pathologist using high-resolution microscopes to count the cells. Unfortunately, can be an exhaustive task with poor reproducibility, especially for non-experts. Deep learning networks have recently been adapted to medical applications which are able to automatically localize these regions of interest. However, these region-based networks lack the ability to take advantage of the segmentation features produced by a full image CNN which are often used as a sole method of detection. Therefore, the proposed method leverages Faster RCNN for object detection while fusing segmentation features generated by a UNet with RGB image features to achieve an F-score of 0.508 on the MITOS-ATYPIA 2014 mitosis counting challenge dataset, outperforming state-of-the-art methods.
    Reinforcement Learning for Task Specifications with Action-Constraints. (arXiv:2201.00286v1 [cs.LG])
    (0 min) In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the Q-learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.
    Reconstructing spectral functions via automatic differentiation. (arXiv:2111.14760v2 [hep-ph] UPDATED)
    (0 min) Reconstructing spectral functions from Euclidean Green's functions is an important inverse problem in many-body physics. However, the inversion is proved to be ill-posed in the realistic systems with noisy Green's functions. In this Letter, we propose an automatic differentiation(AD) framework as a generic tool for the spectral reconstruction from propagator observable. Exploiting the neural networks' regularization as a non-local smoothness regulator of the spectral function, we represent spectral functions by neural networks and use the propagator's reconstruction error to optimize the network parameters unsupervisedly. In the training process, except for the positive-definite form for the spectral function, there are no other explicit physical priors embedded into the neural networks. The reconstruction performance is assessed through relative entropy and mean square error for two different network representations. Compared to the maximum entropy method, the AD framework achieves better performance in the large-noise situation. It is noted that the freedom of introducing non-local regularization is an inherent advantage of the present framework and may lead to substantial improvements in solving inverse problems.
    Self Punishment and Reward Backfill for Deep Q-Learning. (arXiv:2004.05002v2 [cs.AI] UPDATED)
    (0 min) Reinforcement learning agents learn by encouraging behaviours which maximize their total reward, usually provided by the environment. In many environments, however, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective, an issue known as the credit assignment problem. In this paper, we propose two strategies inspired by behavioural psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward. The first strategy, called self-punishment (SP), discourages the agent from making mistakes that lead to undesirable terminal states. The second strategy, called the rewards backfill (RB), backpropagates the rewards between two rewarded actions. We prove that, under certain assumptions and regardless of the reinforcement learning algorithm used, these two strategies maintain the order of policies in the space of all possible policies in terms of their total reward, and, by extension, maintain the optimal policy. Hence, our proposed strategies integrate with any reinforcement learning algorithm that learns a value or action-value function through experience. We incorporated these two strategies into three popular deep reinforcement learning approaches and evaluated the results on thirty Atari games. After parameter tuning, our results indicate that the proposed strategies improve the tested methods in over 65 percent of tested games by up to over 25 times performance improvement.
    Mind Your Solver! On Adversarial Attack and Defense for Combinatorial Optimization. (arXiv:2201.00402v1 [math.OC])
    (0 min) Combinatorial optimization (CO) is a long-standing challenging task not only in its inherent complexity (e.g. NP-hard) but also the possible sensitivity to input conditions. In this paper, we take an initiative on developing the mechanisms for adversarial attack and defense towards combinatorial optimization solvers, whereby the solver is treated as a black-box function and the original problem's underlying graph structure (which is often available and associated with the problem instance, e.g. DAG, TSP) is attacked under a given budget. In particular, we present a simple yet effective defense strategy to modify the graph structure to increase the robustness of solvers, which shows its universal effectiveness across tasks and solvers.
    A Review of Open-World Learning and Steps Toward Open-World Learning Without Labels. (arXiv:2011.12906v3 [cs.CV] UPDATED)
    (0 min) In open-world learning, an agent starts with a set of known classes, detects, and manages things that it does not know, and learns them over time from a non-stationary stream of data. Open-world learning is related to but also distinct from a multitude of other learning problems and this paper briefly analyzes the key differences between a wide range of problems including incremental learning, generalized novelty discovery, and generalized zero-shot learning. This paper formalizes various open-world learning problems including open-world learning without labels. These open-world problems can be addressed with modifications to known elements, we present a new framework that enables agents to combine various modules for novelty-detection, novelty-characterization, incremental learning, and instance management to learn new classes from a stream of unlabeled data in an unsupervised manner, survey how to adapt a few state-of-the-art techniques to fit the framework and use them to define seven baselines for performance on the open-world learning without labels problem. We then discuss open-world learning quality and analyze how that can improve instance management. We also discuss some of the general ambiguity issues that occur in open-world learning without labels.
    Latent structure blockmodels for Bayesian spectral graph clustering. (arXiv:2107.01734v2 [stat.ML] UPDATED)
    (0 min) Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding, but this is not explicitly accounted for in most statistical models for networks. In this article, a class of models called latent structure block models (LSBM) is proposed to address such scenarios, allowing for graph clustering when community-specific one dimensional manifold structure is present. LSBMs focus on a specific class of latent space model, the random dot product graph (RDPG), and assign a latent submanifold to the latent positions of each community. A Bayesian model for the embeddings arising from LSBMs is discussed, and shown to have a good performance on simulated and real world network data. The model is able to correctly recover the underlying communities living in a one-dimensional manifold, even when the parametric form of the underlying curves is unknown, achieving remarkable results on a variety of real data.
    Distributed Evolution Strategies Using TPUs for Meta-Learning. (arXiv:2201.00093v1 [cs.NE])
    (0 min) Meta-learning traditionally relies on backpropagation through entire tasks to iteratively improve a model's learning dynamics. However, this approach is computationally intractable when scaled to complex tasks. We propose a distributed evolutionary meta-learning strategy using Tensor Processing Units (TPUs) that is highly parallel and scalable to arbitrarily long tasks with no increase in memory cost. Using a Prototypical Network trained with evolution strategies on the Omniglot dataset, we achieved an accuracy of 98.4% on a 5-shot classification problem. Our algorithm used as much as 40 times less memory than automatic differentiation to compute the gradient, with the resulting model achieving accuracy within 1.3% of a backpropagation-trained equivalent (99.6%). We observed better classification accuracy as high as 99.1% with larger population configurations. We further experimentally validate the stability and performance of ES-ProtoNet across a variety of training conditions (varying population size, model size, number of workers, shot, way, ES hyperparameters, etc.). Our contributions are twofold: we provide the first assessment of evolutionary meta-learning in a supervised setting, and create a general framework for distributed evolution strategies on TPUs.
    Dynamic Persistent Homology for Brain Networks via Wasserstein Graph Clustering. (arXiv:2201.00087v1 [math.AT])
    (0 min) We present the novel Wasserstein graph clustering for dynamically changing graphs. The Wasserstein clustering penalizes the topological discrepancy between graphs. The Wasserstein clustering is shown to outperform the widely used k-means clustering. The method applied in more accurate determination of the state spaces of dynamically changing functional brain networks.
    AutoGCL: Automated Graph Contrastive Learning via Learnable View Generators. (arXiv:2109.10259v2 [cs.LG] UPDATED)
    (0 min) Contrastive learning has been widely applied to graph representation learning, where the view generators play a vital role in generating effective contrastive samples. Most of the existing contrastive learning methods employ pre-defined view generation methods, e.g., node drop or edge perturbation, which usually cannot adapt to input data or preserve the original semantic structures well. To address this issue, we propose a novel framework named Automated Graph Contrastive Learning (AutoGCL) in this paper. Specifically, AutoGCL employs a set of learnable graph view generators orchestrated by an auto augmentation strategy, where every graph view generator learns a probability distribution of graphs conditioned by the input. While the graph view generators in AutoGCL preserve the most representative structures of the original graph in generation of every contrastive sample, the auto augmentation learns policies to introduce adequate augmentation variances in the whole contrastive learning procedure. Furthermore, AutoGCL adopts a joint training strategy to train the learnable view generators, the graph encoder, and the classifier in an end-to-end manner, resulting in topological heterogeneity yet semantic similarity in the generation of contrastive samples. Extensive experiments on semi-supervised learning, unsupervised learning, and transfer learning demonstrate the superiority of our AutoGCL framework over the state-of-the-arts in graph contrastive learning. In addition, the visualization results further confirm that the learnable view generators can deliver more compact and semantically meaningful contrastive samples compared against the existing view generation methods.
    Learning shared neural manifolds from multi-subject FMRI data. (arXiv:2201.00622v1 [q-bio.NC])
    (0 min) Functional magnetic resonance imaging (fMRI) is a notoriously noisy measurement of brain activity because of the large variations between individuals, signals marred by environmental differences during collection, and spatiotemporal averaging required by the measurement resolution. In addition, the data is extremely high dimensional, with the space of the activity typically having much lower intrinsic dimension. In order to understand the connection between stimuli of interest and brain activity, and analyze differences and commonalities between subjects, it becomes important to learn a meaningful embedding of the data that denoises, and reveals its intrinsic structure. Specifically, we assume that while noise varies significantly between individuals, true responses to stimuli will share common, low-dimensional features between subjects which are jointly discoverable. Similar approaches have been exploited previously but they have mainly used linear methods such as PCA and shared response modeling (SRM). In contrast, we propose a neural network called MRMD-AE (manifold-regularized multiple decoder, autoencoder), that learns a common embedding from multiple subjects in an experiment while retaining the ability to decode to individual raw fMRI signals. We show that our learned common space represents an extensible manifold (where new points not seen during training can be mapped), improves the classification accuracy of stimulus features of unseen timepoints, as well as improves cross-subject translation of fMRI signals. We believe this framework can be used for many downstream applications such as guided brain-computer interface (BCI) training in the future.
    On the convex hull of convex quadratic optimization problems with indicators. (arXiv:2201.00387v1 [math.OC])
    (0 min) We consider the convex quadratic optimization problem with indicator variables and arbitrary constraints on the indicators. We show that a convex hull description of the associated mixed-integer set in an extended space with a quadratic number of additional variables consists of a single positive semidefinite constraint (explicitly stated) and linear constraints. In particular, convexification of this class of problems reduces to describing a polyhedral set in an extended formulation. We also give descriptions in the original space of variables: we provide a description based on an infinite number of conic-quadratic inequalities, which are "finitely generated." In particular, it is possible to characterize whether a given inequality is necessary to describe the convex-hull. The new theory presented here unifies several previously established results, and paves the way toward utilizing polyhedral methods to analyze the convex hull of mixed-integer nonlinear sets.
    Applications of Gaussian Mutation for Self Adaptation in Evolutionary Genetic Algorithms. (arXiv:2201.00285v1 [cs.NE])
    (0 min) In recent years, optimization problems have become increasingly more prevalent due to the need for more powerful computational methods. With the more recent advent of technology such as artificial intelligence, new metaheuristics are needed that enhance the capabilities of classical algorithms. More recently, researchers have been looking at Charles Darwin's theory of natural selection and evolution as a means of enhancing current approaches using machine learning. In 1960, the first genetic algorithm was developed by John H. Holland and his student. We explore the mathematical intuition of the genetic algorithm in developing systems capable of evolving using Gaussian mutation, as well as its implications in solving optimization problems.
    Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions. (arXiv:2201.00768v1 [cs.CL])
    (0 min) Recent natural language processing (NLP) techniques have accomplished high performance on benchmark datasets, primarily due to the significant improvement in the performance of deep learning. The advances in the research community have led to great enhancements in state-of-the-art production systems for NLP tasks, such as virtual assistants, speech recognition, and sentiment analysis. However, such NLP systems still often fail when tested with adversarial attacks. The initial lack of robustness exposed troubling gaps in current models' language understanding capabilities, creating problems when NLP systems are deployed in real life. In this paper, we present a structured overview of NLP robustness research by summarizing the literature in a systemic way across various dimensions. We then take a deep-dive into the various dimensions of robustness, across techniques, metrics, embeddings, and benchmarks. Finally, we argue that robustness should be multi-dimensional, provide insights into current research, identify gaps in the literature to suggest directions worth pursuing to address these gaps.
    Statistical and Topological Properties of Sliced Probability Divergences. (arXiv:2003.05783v2 [stat.ML] UPDATED)
    (0 min) The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments.
    Elsa: Energy-based learning for semi-supervised anomaly detection. (arXiv:2103.15296v2 [cs.CV] UPDATED)
    (0 min) Anomaly detection aims at identifying deviant instances from the normal data distribution. Many advances have been made in the field, including the innovative use of unsupervised contrastive learning. However, existing methods generally assume clean training data and are limited when the data contain unknown anomalies. This paper presents Elsa, a novel semi-supervised anomaly detection approach that unifies the concept of energy-based models with unsupervised contrastive learning. Elsa instills robustness against any data contamination by a carefully designed fine-tuning step based on the new energy function that forces the normal data to be divided into classes of prototypes. Experiments on multiple contamination scenarios show the proposed model achieves SOTA performance. Extensive analyses also verify the contribution of each component in the proposed model. Beyond the experiments, we also offer a theoretical interpretation of why contrastive learning alone cannot detect anomalies under data contamination.
    Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures. (arXiv:2104.13628v2 [cs.LG] UPDATED)
    (0 min) Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
    Topic Analysis of Superconductivity Literature by Semantic Non-negative Matrix Factorization. (arXiv:2201.00687v1 [cs.DL])
    (0 min) We utilize a recently developed topic modeling method called SeNMFk, extending the standard Non-negative Matrix Factorization (NMF) methods by incorporating the semantic structure of the text, and adding a robust system for determining the number of topics. With SeNMFk, we were able to extract coherent topics validated by human experts. From these topics, a few are relatively general and cover broad concepts, while the majority can be precisely mapped to specific scientific effects or measurement techniques. The topics also differ by ubiquity, with only three topics prevalent in almost 40 percent of the abstract, while each specific topic tends to dominate a small subset of the abstracts. These results demonstrate the ability of SeNMFk to produce a layered and nuanced analysis of large scientific corpora.
    Using Non-Stationary Bandits for Learning in Repeated Cournot Games with Non-Stationary Demand. (arXiv:2201.00486v1 [cs.LG])
    (0 min) Many past attempts at modeling repeated Cournot games assume that demand is stationary. This does not align with real-world scenarios in which market demands can evolve over a product's lifetime for a myriad of reasons. In this paper, we model repeated Cournot games with non-stationary demand such that firms/agents face separate instances of non-stationary multi-armed bandit problem. The set of arms/actions that an agent can choose from represents discrete production quantities; here, the action space is ordered. Agents are independent and autonomous, and cannot observe anything from the environment; they can only see their own rewards after taking an action, and only work towards maximizing these rewards. We propose a novel algorithm 'Adaptive with Weighted Exploration (AWE) $\epsilon$-greedy' which is remotely based on the well-known $\epsilon$-greedy approach. This algorithm detects and quantifies changes in rewards due to varying market demand and varies learning rate and exploration rate in proportion to the degree of changes in demand, thus enabling agents to better identify new optimal actions. For efficient exploration, it also deploys a mechanism for weighing actions that takes advantage of the ordered action space. We use simulations to study the emergence of various equilibria in the market. In addition, we study the scalability of our approach in terms number of total agents in the system and the size of action space. We consider both symmetric and asymmetric firms in our models. We found that using our proposed method, agents are able to swiftly change their course of action according to the changes in demand, and they also engage in collusive behavior in many simulations.
    Fair Data Representation for Machine Learning at the Pareto Frontier. (arXiv:2201.00292v1 [stat.ML])
    (0 min) As machine learning powered decision making is playing an increasingly important role in our daily lives, it is imperative to strive for fairness of the underlying data processing and algorithms. We propose a pre-processing algorithm for fair data representation via which L2- objective supervised learning algorithms result in an estimation of the Pareto frontier between prediction error and statistical disparity. In particular, the present work applies the optimal positive definite affine transport maps to approach the post-processing Wasserstein barycenter characterization of the optimal fair L2-objective supervised learning via a pre-processing data deformation. We call the resulting data Wasserstein pseudo-barycenter. Furthermore, we show that the Wasserstein geodesics from the learning outcome marginals to the barycenter characterizes the Pareto frontier between L2-loss and total Wasserstein distance among learning outcome marginals. Thereby, an application of McCann interpolation generalizes the pseudo-barycenter to a family of data representations via which L2-objective supervised learning algorithms result in the Pareto frontier. Numerical simulations underscore the advantages of the proposed data representation: (1) the pre-processing step is compositive with arbitrary L2-objective supervised learning methods and unseen data; (2) the fair representation protects data privacy by preventing the training machine from direct or indirect access to the sensitive information of the data; (3) the optimal affine map results in efficient computation of fair supervised learning on high-dimensional data; (4) experimental results shed light on the fairness of L2-objective unsupervised learning via the proposed fair data representation.
    Novelty-based Generalization Evaluation for Traffic Light Detection. (arXiv:2201.00531v1 [cs.CV])
    (0 min) The advent of Convolutional Neural Networks (CNNs) has led to their application in several domains. One noteworthy application is the perception system for autonomous driving that relies on the predictions from CNNs. Practitioners evaluate the generalization ability of such CNNs by calculating various metrics on an independent test dataset. A test dataset is often chosen based on only one precondition, i.e., its elements are not a part of the training data. Such a dataset may contain objects that are both similar and novel w.r.t. the training dataset. Nevertheless, existing works do not reckon the novelty of the test samples and treat them all equally for evaluating generalization. Such novelty-based evaluations are of significance to validate the fitness of a CNN in autonomous driving applications. Hence, we propose a CNN generalization scoring framework that considers novelty of objects in the test dataset. We begin with the representation learning technique to reduce the image data into a low-dimensional space. It is on this space we estimate the novelty of the test samples. Finally, we calculate the generalization score as a combination of the test data prediction performance and novelty. We perform an experimental study of the same for our traffic light detection application. In addition, we systematically visualize the results for an interpretable notion of novelty.
    Rethinking Feature Uncertainty in Stochastic Neural Networks for Adversarial Robustness. (arXiv:2201.00148v1 [cs.LG])
    (2 min) It is well-known that deep neural networks (DNNs) have shown remarkable success in many fields. However, when adding an imperceptible magnitude perturbation on the model input, the model performance might get rapid decrease. To address this issue, a randomness technique has been proposed recently, named Stochastic Neural Networks (SNNs). Specifically, SNNs inject randomness into the model to defend against unseen attacks and improve the adversarial robustness. However, existed studies on SNNs mainly focus on injecting fixed or learnable noises to model weights/activations. In this paper, we find that the existed SNNs performances are largely bottlenecked by the feature representation ability. Surprisingly, simply maximizing the variance per dimension of the feature distribution leads to a considerable boost beyond all previous methods, which we named maximize feature distribution variance stochastic neural network (MFDV-SNN). Extensive experiments on well-known white- and black-box attacks show that MFDV-SNN achieves a significant improvement over existing methods, which indicates that it is a simple but effective method to improve model robustness.
    Stochastic Weight Averaging Revisited. (arXiv:2201.00519v1 [cs.LG])
    (0 min) Stochastic weight averaging (SWA) is recognized as a simple while one effective approach to improve the generalization of stochastic gradient descent (SGD) for training deep neural networks (DNNs). A common insight to explain its success is that averaging weights following an SGD process equipped with cyclical or high constant learning rates can discover wider optima, which then lead to better generalization. We give a new insight that does not concur with the above one. We characterize that SWA's performance is highly dependent on to what extent the SGD process that runs before SWA converges, and the operation of weight averaging only contributes to variance reduction. This new insight suggests practical guides on better algorithm design. As an instantiation, we show that following an SGD process with insufficient convergence, running SWA more times leads to continual incremental benefits in terms of generalization. Our findings are corroborated by extensive experiments across different network architectures, including a baseline CNN, PreResNet-164, WideResNet-28-10, VGG16, ResNet-50, ResNet-152, DenseNet-161, and different datasets including CIFAR-{10,100}, and Imagenet.
    Minimum Excess Risk in Bayesian Learning. (arXiv:2012.14868v2 [cs.LG] UPDATED)
    (0 min) We analyze the best achievable performance of Bayesian learning under generative models by defining and upper-bounding the minimum excess risk (MER): the gap between the minimum expected loss attainable by learning from data and the minimum expected loss that could be achieved if the model realization were known. The definition of MER provides a principled way to define different notions of uncertainties in Bayesian learning, including the aleatoric uncertainty and the minimum epistemic uncertainty. Two methods for deriving upper bounds for the MER are presented. The first method, generally suitable for Bayesian learning with a parametric generative model, upper-bounds the MER by the conditional mutual information between the model parameters and the quantity being predicted given the observed data. It allows us to quantify the rate at which the MER decays to zero as more data becomes available. Under realizable models, this method also relates the MER to the richness of the generative function class, notably the VC dimension in binary classification. The second method, particularly suitable for Bayesian learning with a parametric predictive model, relates the MER to the minimum estimation error of the model parameters from data. It explicitly shows how the uncertainty in model parameter estimation translates to the MER and to the final prediction uncertainty. We also extend the definition and analysis of MER to the setting with multiple model families and the setting with nonparametric models. Along the discussions we draw some comparisons between the MER in Bayesian learning and the excess risk in frequentist learning.
    A Mixed Integer Programming Approach to Training Dense Neural Networks. (arXiv:2201.00723v1 [cs.LG])
    (0 min) Artificial Neural Networks (ANNs) are prevalent machine learning models that have been applied across various real world classification tasks. ANNs require a large amount of data to have strong out of sample performance, and many algorithms for training ANN parameters are based on stochastic gradient descent (SGD). However, the SGD ANNs that tend to perform best on prediction tasks are trained in an end to end manner that requires a large number of model parameters and random initialization. This means training ANNs is very time consuming and the resulting models take a lot of memory to deploy. In order to train more parsimonious ANN models, we propose the use of alternative methods from the constrained optimization literature for ANN training and pretraining. In particular, we propose novel mixed integer programming (MIP) formulations for training fully-connected ANNs. Our formulations can account for both binary activation and rectified linear unit (ReLU) activation ANNs, and for the use of a log likelihood loss. We also develop a layer-wise greedy approach, a technique adapted for reducing the number of layers in the ANN, for model pretraining using our MIP formulations. We then present numerical experiments comparing our MIP based methods against existing SGD based approaches and show that we are able to achieve models with competitive out of sample performance that are significantly more parsimonious.
    On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations. (arXiv:2201.00318v1 [cs.CL])
    (0 min) Text classification is a fundamental Natural Language Processing task that has a wide variety of applications, where deep learning approaches have produced state-of-the-art results. While these models have been heavily criticized for their black-box nature, their robustness to slight perturbations in input text has been a matter of concern. In this work, we carry out a data-focused study evaluating the impact of systematic practical perturbations on the performance of the deep learning based text classification models like CNN, LSTM, and BERT-based algorithms. The perturbations are induced by the addition and removal of unwanted tokens like punctuation and stop-words that are minimally associated with the final performance of the model. We show that these deep learning approaches including BERT are sensitive to such legitimate input perturbations on four standard benchmark datasets SST2, TREC-6, BBC News, and tweet_eval. We observe that BERT is more susceptible to the removal of tokens as compared to the addition of tokens. Moreover, LSTM is slightly more sensitive to input perturbations as compared to CNN based model. The work also serves as a practical guide to assessing the impact of discrepancies in train-test conditions on the final performance of models.
    SAFL: A Self-Attention Scene Text Recognizer with Focal Loss. (arXiv:2201.00132v1 [cs.CV])
    (2 min) In the last decades, scene text recognition has gained worldwide attention from both the academic community and actual users due to its importance in a wide range of applications. Despite achievements in optical character recognition, scene text recognition remains challenging due to inherent problems such as distortions or irregular layout. Most of the existing approaches mainly leverage recurrence or convolution-based neural networks. However, while recurrent neural networks (RNNs) usually suffer from slow training speed due to sequential computation and encounter problems as vanishing gradient or bottleneck, CNN endures a trade-off between complexity and performance. In this paper, we introduce SAFL, a self-attention-based neural network model with the focal loss for scene text recognition, to overcome the limitation of the existing approaches. The use of focal loss instead of negative log-likelihood helps the model focus more on low-frequency samples training. Moreover, to deal with the distortions and irregular texts, we exploit Spatial TransformerNetwork (STN) to rectify text before passing to the recognition network. We perform experiments to compare the performance of the proposed model with seven benchmarks. The numerical results show that our model achieves the best performance.
    Support vector machines and Radon's theorem. (arXiv:2011.00617v2 [cs.LG] UPDATED)
    (2 min) A support vector machine (SVM) is an algorithm which finds a hyperplane that optimally separates labeled data points in $\mathbb{R}^n$ into positive and negative classes. The data points on the margin of this separating hyperplane are called support vectors. We connect the possible configurations of support vectors to Radon's theorem, which provides guarantees for when a set of points can be divided into two classes (positive and negative) whose convex hulls intersect. If the convex hulls of the positive and negative support vectors are projected onto a separating hyperplane, the projections intersect in at least one point if and only if the hyperplane is optimal. Further, with a particular type of general position, we show that (a) the projected convex hulls of the support vectors intersect in exactly one point, (b) the support vectors are stable under perturbation, (c) there are at most $n+1$ support vectors, and (d) every number of support vectors from 2 up to $n+1$ is possible. Finally, we perform computer simulations studying the expected number of support vectors, and their configurations, for randomly generated data. We observe that as the distance between classes of points increases for this type of randomly generated data, configurations with two support vectors become the most likely configurations.
    Continuity of Generalized Entropy and Statistical Learning. (arXiv:2012.15829v2 [cs.LG] UPDATED)
    (2 min) We study the continuity property of the generalized entropy as a function of the underlying probability distribution, defined with an action space and a loss function, and use this property to answer the basic questions in statistical learning theory: the excess risk analyses for various learning methods. We first derive upper and lower bounds for the entropy difference of two distributions in terms of several commonly used f-divergences, the Wasserstein distance, a distance that depends on the action space and the loss function, and the Bregman divergence generated by the entropy, which also induces bounds in terms of the Euclidean distance between the two distributions. Examples are given along with the discussion of each general result, comparisons are made with the existing entropy difference bounds, and new mutual information upper bounds are derived based on the new results. We then apply the entropy difference bounds to the theory of statistical learning. It is shown that the excess risks in the two popular learning paradigms, the frequentist learning and the Bayesian learning, both can be studied with the continuity property of different forms of the generalized entropy. The analysis is then extended to the continuity of generalized conditional entropy. The extension provides performance bounds for Bayes decision making with mismatched distributions. It also leads to excess risk bounds for a third paradigm of learning, where the decision rule is optimally designed under the projection of the empirical distribution to a predefined family of distributions. We thus establish a unified method of excess risk analysis for the three major paradigms of statistical learning, through the continuity of generalized entropy.
    Concept Embeddings for Fuzzy Logic Verification of Deep Neural Networks in Perception Tasks. (arXiv:2201.00572v1 [cs.CV])
    (2 min) One major drawback of deep neural networks (DNNs) for use in sensitive application domains is their black-box nature. This makes it hard to verify or monitor complex, symbolic requirements. In this work, we present a simple, yet effective, approach to verify whether a trained convolutional neural network (CNN) respects specified symbolic background knowledge. The knowledge may consist of any fuzzy predicate logic rules. For this, we utilize methods from explainable artificial intelligence (XAI): First, using concept embedding analysis, the output of a computer vision CNN is post-hoc enriched by concept outputs; second, logical rules from prior knowledge are fuzzified to serve as continuous-valued functions on the concept outputs. These can be evaluated with little computational overhead. We demonstrate three diverse use-cases of our method on stateof-the-art object detectors: Finding corner cases, utilizing the rules for detecting and localizing DNN misbehavior during runtime, and comparing the logical consistency of DNNs. The latter is used to find related differences between EfficientDet D1 and Mask R-CNN object detectors. We show that this approach benefits from fuzziness and calibrating the concept outputs.
    Toward Causal-Aware RL: State-Wise Action-Refined Temporal Difference. (arXiv:2201.00354v1 [cs.LG])
    (2 min) Although it is well known that exploration plays a key role in Reinforcement Learning (RL), prevailing exploration strategies for continuous control tasks in RL are mainly based on naive isotropic Gaussian noise regardless of the causality relationship between action space and the task and consider all dimensions of actions equally important. In this work, we propose to conduct interventions on the primal action space to discover the causal relationship between the action space and the task reward. We propose the method of State-Wise Action Refined (SWAR), which addresses the issue of action space redundancy and promote causality discovery in RL. We formulate causality discovery in RL tasks as a state-dependent action space selection problem and propose two practical algorithms as solutions. The first approach, TD-SWAR, detects task-related actions during temporal difference learning, while the second approach, Dyn-SWAR, reveals important actions through dynamic model prediction. Empirically, both methods provide approaches to understand the decisions made by RL agents and improve learning efficiency in action-redundant tasks.
    The complementarity of a diverse range of deep learning features extracted from video content for video recommendation. (arXiv:2011.10834v2 [cs.IR] UPDATED)
    (2 min) Following the popularisation of media streaming, a number of video streaming services are continuously buying new video content to mine the potential profit from them. As such, the newly added content has to be handled well to be recommended to suitable users. In this paper, we address the new item cold-start problem by exploring the potential of various deep learning features to provide video recommendations. The deep learning features investigated include features that capture the visual-appearance, audio and motion information from video content. We also explore different fusion methods to evaluate how well these feature modalities can be combined to fully exploit the complementary information captured by them. Experiments on a real-world video dataset for movie recommendations show that deep learning features outperform hand-crafted features. In particular, recommendations generated with deep learning audio features and action-centric deep learning features are superior to MFCC and state-of-the-art iDT features. In addition, the combination of various deep learning features with hand-crafted features and textual metadata yields significant improvement in recommendations compared to combining only the former.
    Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning. (arXiv:2201.00236v1 [cs.LG])
    (2 min) Reinforcement learning (RL) has drawn increasing interests in recent years due to its tremendous success in various applications. However, standard RL algorithms can only be applied for single reward function, and cannot adapt to an unseen reward function quickly. In this paper, we advocate a general operator view of reinforcement learning, which enables us to directly approximate the operator that maps from reward function to value function. The benefit of learning the operator is that we can incorporate any new reward function as input and attain its corresponding value function in a zero-shot manner. To approximate this special type of operator, we design a number of novel operator neural network architectures based on its theoretical properties. Our design of operator networks outperform the existing methods and the standard design of general purpose operator network, and we demonstrate the benefit of our operator deep Q-learning framework in several tasks including reward transferring for offline policy evaluation (OPE) and reward transferring for offline policy optimization in a range of tasks.
    Multi-Dimensional Model Compression of Vision Transformer. (arXiv:2201.00043v1 [cs.CV])
    (2 min) Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment. Previous ViT pruning methods tend to prune the model along one dimension solely, which may suffer from excessive reduction and lead to sub-optimal model quality. In contrast, we advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly. We firstly propose a statistical dependence based pruning criterion that is generalizable to different dimensions for identifying deleterious components. Moreover, we cast the multi-dimensional compression as an optimization, learning the optimal pruning policy across the three dimensions that maximizes the compressed model's accuracy under a computational budget. The problem is solved by our adapted Gaussian process search with expected improvement. Experimental results show that our method effectively reduces the computational cost of various ViT models. For example, our method reduces 40\% FLOPs without top-1 accuracy loss for DeiT and T2T-ViT models, outperforming previous state-of-the-arts.
    Application of Machine Learning Methods in Inferring Surface Water Groundwater Exchanges using High Temporal Resolution Temperature Measurements. (arXiv:2201.00726v1 [cs.LG])
    (2 min) We examine the ability of machine learning (ML) and deep learning (DL) algorithms to infer surface/ground exchange flux based on subsurface temperature observations. The observations and fluxes are produced from a high-resolution numerical model representing conditions in the Columbia River near the Department of Energy Hanford site located in southeastern Washington State. Random measurement error, of varying magnitude, is added to the synthetic temperature observations. The results indicate that both ML and DL methods can be used to infer the surface/ground exchange flux. DL methods, especially convolutional neural networks, outperform the ML methods when used to interpret noisy temperature data with a smoothing filter applied. However, the ML methods also performed well and they are can better identify a reduced number of important observations, which could be useful for measurement network optimization. Surprisingly, the ML and DL methods better inferred upward flux than downward flux. This is in direct contrast to previous findings using numerical models to infer flux from temperature observations and it may suggest that combined use of ML or DL inference with numerical inference could improve flux estimation beneath river systems.
    'Moving On' -- Investigating Inventors' Ethnic Origins Using Supervised Learning. (arXiv:2201.00578v1 [econ.GN])
    (2 min) Patent data provides rich information about technical inventions, but does not disclose the ethnic origin of inventors. In this paper, I use supervised learning techniques to infer this information. To do so, I construct a dataset of 95'202 labeled names and train an artificial recurrent neural network with long-short-term memory (LSTM) to predict ethnic origins based on names. The trained network achieves an overall performance of 91% across 17 ethnic origins. I use this model to classify and investigate the ethnic origins of 2.68 million inventors and provide novel descriptive evidence regarding their ethnic origin composition over time and across countries and technological fields. The global ethnic origin composition has become more diverse over the last decades, which was mostly due to a relative increase of Asian origin inventors. Furthermore, the prevalence of foreign-origin inventors is especially high in the USA, but has also increased in other high-income economies. This increase was mainly driven by an inflow of non-western inventors into emerging high-technology fields for the USA, but not for other high-income countries.
    Semi-supervised Stance Detection of Tweets Via Distant Network Supervision. (arXiv:2201.00614v1 [cs.SI])
    (2 min) Detecting and labeling stance in social media text is strongly motivated by hate speech detection, poll prediction, engagement forecasting, and concerted propaganda detection. Today's best neural stance detectors need large volumes of training data, which is difficult to curate given the fast-changing landscape of social media text and issues on which users opine. Homophily properties over the social network provide strong signal of coarse-grained user-level stance. But semi-supervised approaches for tweet-level stance detection fail to properly leverage homophily. In light of this, We present SANDS, a new semi-supervised stance detector. SANDS starts from very few labeled tweets. It builds multiple deep feature views of tweets. It also uses a distant supervision signal from the social network to provide a surrogate loss signal to the component learners. We prepare two new tweet datasets comprising over 236,000 politically tinted tweets from two demographics (US and India) posted by over 87,000 users, their follower-followee graph, and over 8,000 tweets annotated by linguists. SANDS achieves a macro-F1 score of 0.55 (0.49) on US (India)-based datasets, outperforming 17 baselines (including variants of SANDS) substantially, particularly for minority stance labels and noisy text. Numerous ablation experiments on SANDS disentangle the dynamics of textual and network-propagated stance signals.
    Adaptive Memory Networks with Self-supervised Learning for Unsupervised Anomaly Detection. (arXiv:2201.00464v1 [cs.LG])
    (2 min) Unsupervised anomaly detection aims to build models to effectively detect unseen anomalies by only training on the normal data. Although previous reconstruction-based methods have made fruitful progress, their generalization ability is limited due to two critical challenges. First, the training dataset only contains normal patterns, which limits the model generalization ability. Second, the feature representations learned by existing models often lack representativeness which hampers the ability to preserve the diversity of normal patterns. In this paper, we propose a novel approach called Adaptive Memory Network with Self-supervised Learning (AMSL) to address these challenges and enhance the generalization ability in unsupervised anomaly detection. Based on the convolutional autoencoder structure, AMSL incorporates a self-supervised learning module to learn general normal patterns and an adaptive memory fusion module to learn rich feature representations. Experiments on four public multivariate time series datasets demonstrate that AMSL significantly improves the performance compared to other state-of-the-art methods. Specifically, on the largest CAP sleep stage detection dataset with 900 million samples, AMSL outperforms the second-best baseline by \textbf{4}\%+ in both accuracy and F1 score. Apart from the enhanced generalization ability, AMSL is also more robust against input noise.
    Exploiting Bi-directional Global Transition Patterns and Personal Preferences for Missing POI Category Identification. (arXiv:2201.00014v1 [cs.LG])
    (2 min) Recent years have witnessed the increasing popularity of Location-based Social Network (LBSN) services, which provides unparalleled opportunities to build personalized Point-of-Interest (POI) recommender systems. Existing POI recommendation and location prediction tasks utilize past information for future recommendation or prediction from a single direction perspective, while the missing POI category identification task needs to utilize the check-in information both before and after the missing category. Therefore, a long-standing challenge is how to effectively identify the missing POI categories at any time in the real-world check-in data of mobile users. To this end, in this paper, we propose a novel neural network approach to identify the missing POI categories by integrating both bi-directional global non-personal transition patterns and personal preferences of users. Specifically, we delicately design an attention matching cell to model how well the check-in category information matches their non-personal transition patterns and personal preferences. Finally, we evaluate our model on two real-world datasets, which clearly validate its effectiveness compared with the state-of-the-art baselines. Furthermore, our model can be naturally extended to address next POI category recommendation and prediction tasks with competitive performance.
    SAE: Sequential Anchored Ensembles. (arXiv:2201.00649v1 [cs.LG])
    (2 min) Computing the Bayesian posterior of a neural network is a challenging task due to the high-dimensionality of the parameter space. Anchored ensembles approximate the posterior by training an ensemble of neural networks on anchored losses designed for the optima to follow the Bayesian posterior. Training an ensemble, however, becomes computationally expensive as its number of members grows since the full training procedure is repeated for each member. In this note, we present Sequential Anchored Ensembles (SAE), a lightweight alternative to anchored ensembles. Instead of training each member of the ensemble from scratch, the members are trained sequentially on losses sampled with high auto-correlation, hence enabling fast convergence of the neural networks and efficient approximation of the Bayesian posterior. SAE outperform anchored ensembles, for a given computational budget, on some benchmarks while showing comparable performance on the others and achieved 2nd and 3rd place in the light and extended tracks of the NeurIPS 2021 Approximate Inference in Bayesian Deep Learning competition.
    Matrix Decomposition and Applications. (arXiv:2201.00145v1 [math.NA])
    (2 min) In 1954, Alston S. Householder published Principles of Numerical Analysis, one of the first modern treatments on matrix decomposition that favored a (block) LU decomposition-the factorization of a matrix into the product of lower and upper triangular matrices. And now, matrix decomposition has become a core technology in machine learning, largely due to the development of the back propagation algorithm in fitting a neural network. The sole aim of this survey is to give a self-contained introduction to concepts and mathematical tools in numerical linear algebra and matrix analysis in order to seamlessly introduce matrix decomposition techniques and their applications in subsequent sections. However, we clearly realize our inability to cover all the useful and interesting results concerning matrix decomposition and given the paucity of scope to present this discussion, e.g., the separated analysis of the Euclidean space, Hermitian space, Hilbert space, and things in the complex domain. We refer the reader to literature in the field of linear algebra for a more detailed introduction to the related fields.
    Recover the spectrum of covariance matrix: a non-asymptotic iterative method. (arXiv:2201.00230v1 [stat.ML])
    (0 min) It is well known the sample covariance has a consistent bias in the spectrum, for example spectrum of Wishart matrix follows the Marchenko-Pastur law. We in this work introduce an iterative algorithm 'Concent' that actively eliminate this bias and recover the true spectrum for small and moderate dimensions.
    Artificial Intelligence and Statistical Techniques in Short-Term Load Forecasting: A Review. (arXiv:2201.00437v1 [cs.LG])
    (2 min) Electrical utilities depend on short-term demand forecasting to proactively adjust production and distribution in anticipation of major variations. This systematic review analyzes 240 works published in scholarly journals between 2000 and 2019 that focus on applying Artificial Intelligence (AI), statistical, and hybrid models to short-term load forecasting (STLF). This work represents the most comprehensive review of works on this subject to date. A complete analysis of the literature is conducted to identify the most popular and accurate techniques as well as existing gaps. The findings show that although Artificial Neural Networks (ANN) continue to be the most commonly used standalone technique, researchers have been exceedingly opting for hybrid combinations of different techniques to leverage the combined advantages of individual methods. The review demonstrates that it is commonly possible with these hybrid combinations to achieve prediction accuracy exceeding 99%. The most successful duration for short-term forecasting has been identified as prediction for a duration of one day at an hourly interval. The review has identified a deficiency in access to datasets needed for training of the models. A significant gap has been identified in researching regions other than Asia, Europe, North America, and Australia.
    Asymptotic Convergence of Deep Multi-Agent Actor-Critic Algorithms. (arXiv:2201.00570v1 [cs.LG])
    (2 min) We present sufficient conditions that ensure convergence of the multi-agent Deep Deterministic Policy Gradient (DDPG) algorithm. It is an example of one of the most popular paradigms of Deep Reinforcement Learning (DeepRL) for tackling continuous action spaces: the actor-critic paradigm. In the setting considered herein, each agent observes a part of the global state space in order to take local actions, for which it receives local rewards. For every agent, DDPG trains a local actor (policy) and a local critic (Q-function). The analysis shows that multi-agent DDPG using neural networks to approximate the local policies and critics converge to limits with the following properties: The critic limits minimize the average squared Bellman loss; the actor limits parameterize a policy that maximizes the local critic's approximation of $Q_i^*$, where $i$ is the agent index. The averaging is with respect to a probability distribution over the global state-action space. It captures the asymptotics of all local training processes. Finally, we extend the analysis to a fully decentralized setting where agents communicate over a wireless network prone to delays and losses; a typical scenario in, e.g., robotic applications.
    KerGNNs: Interpretable Graph Neural Networks with Graph Kernels. (arXiv:2201.00491v1 [cs.LG])
    (2 min) Graph kernels are historically the most widely-used technique for graph classification tasks. However, these methods suffer from limited performance because of the hand-crafted combinatorial features of graphs. In recent years, graph neural networks (GNNs) have become the state-of-the-art method in downstream graph-related tasks due to their superior performance. Most GNNs are based on Message Passing Neural Network (MPNN) frameworks. However, recent studies show that MPNNs can not exceed the power of the Weisfeiler-Lehman (WL) algorithm in graph isomorphism test. To address the limitations of existing graph kernel and GNN methods, in this paper, we propose a novel GNN framework, termed \textit{Kernel Graph Neural Networks} (KerGNNs), which integrates graph kernels into the message passing process of GNNs. Inspired by convolution filters in convolutional neural networks (CNNs), KerGNNs adopt trainable hidden graphs as graph filters which are combined with subgraphs to update node embeddings using graph kernels. In addition, we show that MPNNs can be viewed as special cases of KerGNNs. We apply KerGNNs to multiple graph-related tasks and use cross-validation to make fair comparisons with benchmarks. We show that our method achieves competitive performance compared with existing state-of-the-art methods, demonstrating the potential to increase the representation ability of GNNs. We also show that the trained graph filters in KerGNNs can reveal the local graph structures of the dataset, which significantly improves the model interpretability compared with conventional GNN models.
    Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments. (arXiv:2201.00042v1 [cs.NE])
    (2 min) A key challenge for AI is to build embodied systems that operate in dynamically changing environments. Such systems must adapt to changing task contexts and learn continuously. Although standard deep learning systems achieve state of the art results on static benchmarks, they often struggle in dynamic scenarios. In these settings, error signals from multiple contexts can interfere with one another, ultimately leading to a phenomenon known as catastrophic forgetting. In this article we investigate biologically inspired architectures as solutions to these problems. Specifically, we show that the biophysical properties of dendrites and local inhibitory systems enable networks to dynamically restrict and route information in a context-specific manner. Our key contributions are as follows. First, we propose a novel artificial neural network architecture that incorporates active dendrites and sparse representations into the standard deep learning framework. Next, we study the performance of this architecture on two separate benchmarks requiring task-based adaptation: Meta-World, a multi-task reinforcement learning environment where a robotic agent must learn to solve a variety of manipulation tasks simultaneously; and a continual learning benchmark in which the model's prediction task changes throughout training. Analysis on both benchmarks demonstrates the emergence of overlapping but distinct and sparse subnetworks, allowing the system to fluidly learn multiple tasks with minimal forgetting. Our neural implementation marks the first time a single architecture has achieved competitive results on both multi-task and continual learning settings. Our research sheds light on how biological properties of neurons can inform deep learning systems to address dynamic scenarios that are typically impossible for traditional ANNs to solve.
    Role of Data Augmentation Strategies in Knowledge Distillation for Wearable Sensor Data. (arXiv:2201.00111v1 [cs.LG])
    (2 min) Deep neural networks are parametrized by several thousands or millions of parameters, and have shown tremendous success in many classification problems. However, the large number of parameters makes it difficult to integrate these models into edge devices such as smartphones and wearable devices. To address this problem, knowledge distillation (KD) has been widely employed, that uses a pre-trained high capacity network to train a much smaller network, suitable for edge devices. In this paper, for the first time, we study the applicability and challenges of using KD for time-series data for wearable devices. Successful application of KD requires specific choices of data augmentation methods during training. However, it is not yet known if there exists a coherent strategy for choosing an augmentation approach during KD. In this paper, we report the results of a detailed study that compares and contrasts various common choices and some hybrid data augmentation strategies in KD based human activity analysis. Research in this area is often limited as there are not many comprehensive databases available in the public domain from wearable devices. Our study considers databases from small scale publicly available to one derived from a large scale interventional study into human activity and sedentary behavior. We find that the choice of data augmentation techniques during KD have a variable level of impact on end performance, and find that the optimal network choice as well as data augmentation strategies are specific to a dataset at hand. However, we also conclude with a general set of recommendations that can provide a strong baseline performance across databases.
    Graph Signal Reconstruction Techniques for IoT Air Pollution Monitoring Platforms. (arXiv:2201.00378v1 [eess.SP])
    (2 min) Air pollution monitoring platforms play a very important role in preventing and mitigating the effects of pollution. Recent advances in the field of graph signal processing have made it possible to describe and analyze air pollution monitoring networks using graphs. One of the main applications is the reconstruction of the measured signal in a graph using a subset of sensors. Reconstructing the signal using information from sensor neighbors can help improve the quality of network data, examples are filling in missing data with correlated neighboring nodes, or correcting a drifting sensor with neighboring sensors that are more accurate. This paper compares the use of various types of graph signal reconstruction methods applied to real data sets of Spanish air pollution reference stations. The methods considered are Laplacian interpolation, graph signal processing low-pass based graph signal reconstruction, and kernel-based graph signal reconstruction, and are compared on actual air pollution data sets measuring O3, NO2, and PM10. The ability of the methods to reconstruct the signal of a pollutant is shown, as well as the computational cost of this reconstruction. The results indicate the superiority of methods based on kernel-based graph signal reconstruction, as well as the difficulties of the methods to scale in an air pollution monitoring network with a large number of low-cost sensors. However, we show that scalability can be overcome with simple methods, such as partitioning the network using a clustering algorithm.
    An Efficient Federated Distillation Learning System for Multi-task Time Series Classification. (arXiv:2201.00011v1 [cs.LG])
    (2 min) This paper proposes an efficient federated distillation learning system (EFDLS) for multi-task time series classification (TSC). EFDLS consists of a central server and multiple mobile users, where different users may run different TSC tasks. EFDLS has two novel components, namely a feature-based student-teacher (FBST) framework and a distance-based weights matching (DBWM) scheme. Within each user, the FBST framework transfers knowledge from its teacher's hidden layers to its student's hidden layers via knowledge distillation, with the teacher and student having identical network structure. For each connected user, its student model's hidden layers' weights are uploaded to the EFDLS server periodically. The DBWM scheme is deployed on the server, with the least square distance used to measure the similarity between the weights of two given models. This scheme finds a partner for each connected user such that the user's and its partner's weights are the closest among all the weights uploaded. The server exchanges and sends back the user's and its partner's weights to these two users which then load the received weights to their teachers' hidden layers. Experimental results show that the proposed EFDLS achieves excellent performance on a set of selected UCR2018 datasets regarding top-1 accuracy.
    Semi-Supervised Graph Attention Networks for Event Representation Learning. (arXiv:2201.00363v1 [cs.LG])
    (2 min) Event analysis from news and social networks is very useful for a wide range of social studies and real-world applications. Recently, event graphs have been explored to model event datasets and their complex relationships, where events are vertices connected to other vertices representing locations, people's names, dates, and various other event metadata. Graph representation learning methods are promising for extracting latent features from event graphs to enable the use of different classification algorithms. However, existing methods fail to meet essential requirements for event graphs, such as (i) dealing with semi-supervised graph embedding to take advantage of some labeled events, (ii) automatically determining the importance of the relationships between event vertices and their metadata vertices, as well as (iii) dealing with the graph heterogeneity. This paper presents GNEE (GAT Neural Event Embeddings), a method that combines Graph Attention Networks and Graph Regularization. First, an event graph regularization is proposed to ensure that all graph vertices receive event features, thereby mitigating the graph heterogeneity drawback. Second, semi-supervised graph embedding with self-attention mechanism considers existing labeled events, as well as learns the importance of relationships in the event graph during the representation learning process. A statistical analysis of experimental results with five real-world event graphs and six graph embedding methods shows that our GNEE outperforms state-of-the-art semi-supervised graph embedding methods.
    Augmentative eXplanation and the Distributional Gap of Confidence Optimization Score. (arXiv:2201.00009v1 [cs.LG])
    (2 min) This paper introduces the Confidence Optimization (CO) score to directly measure the contribution of heatmaps/saliency maps to the classification performance of a model. Common heatmap generation methods used in the eXplainable Artificial Intelligence (XAI) community are tested through a process we call the Augmentative eXplanation (AX). We find a surprising \textit{gap} in CO scores distribution on these heatmap methods. The gap potentially serves as a novel indicator for the correctness of deep neural network (DNN) prediction. We further introduces Generative AX (GAX) method to generate saliency maps capable of attaining high CO scores. Using GAX, we also qualitatively demonstrate the unintuitiveness of DNN architectures.
    Validation and Transparency in AI systems for pharmacovigilance: a case study applied to the medical literature monitoring of adverse events. (arXiv:2201.00692v1 [cs.IR])
    (2 min) Recent advances in artificial intelligence applied to biomedical text are opening exciting opportunities for improving pharmacovigilance activities currently burdened by the ever growing volumes of real world data. To fully realize these opportunities, existing regulatory guidance and industry best practices should be taken into consideration in order to increase the overall trustworthiness of the system and enable broader adoption. In this paper we present a case study on how to operationalize existing guidance for validated AI systems in pharmacovigilance focusing on the specific task of medical literature monitoring (MLM) of adverse events from the scientific literature. We describe an AI system designed with the goal of reducing effort in MLM activities built in close collaboration with subject matter experts and considering guidance for validated systems in pharmacovigilance and AI transparency. In particular we make use of public disclosures as a useful risk control measure to mitigate system misuse and earn user trust. In addition we present experimental results showing the system can significantly remove screening effort while maintaining high levels of recall (filtering 55% of irrelevant articles on average, for a target recall of 0.99 on suspected adverse articles) and provide a robust method for tuning the desired recall to suit a particular risk profile.
    A Literature Review on Length of Stay Prediction for Stroke Patients using Machine Learning and Statistical Approaches. (arXiv:2201.00005v1 [cs.LG])
    (2 min) Hospital length of stay (LOS) is one of the most essential healthcare metrics that reflects the hospital quality of service and helps improve hospital scheduling and management. LOS prediction helps in cost management because patients who remain in hospitals usually do so in hospital units where resources are severely limited. In this study, we reviewed papers on LOS prediction using machine learning and statistical approaches. Our literature review considers research studies that focus on LOS prediction for stroke patients. Some of the surveyed studies revealed that authors reached contradicting conclusions. For example, the age of the patient was considered an important predictor of LOS for stroke patients in some studies, while other studies concluded that age was not a significant factor. Therefore, additional research is required in this domain to further understand the predictors of LOS for stroke patients.
    Performance Comparison of Deep Learning Architectures for Artifact Removal in Gastrointestinal Endoscopic Imaging. (arXiv:2201.00084v1 [eess.IV])
    (2 min) Endoscopic images typically contain several artifacts. The artifacts significantly impact image analysis result in computer-aided diagnosis. Convolutional neural networks (CNNs), a type of deep learning, can removes such artifacts. Various architectures have been proposed for the CNNs, and the accuracy of artifact removal varies depending on the choice of architecture. Therefore, it is necessary to determine the artifact removal accuracy, depending on the selected architecture. In this study, we focus on endoscopic surgical instruments as artifacts, and determine and discuss the artifact removal accuracy using seven different CNN architectures.
    How do lexical semantics affect translation? An empirical study. (arXiv:2201.00075v1 [cs.CL])
    (2 min) Neural machine translation (NMT) systems aim to map text from one language into another. While there are a wide variety of applications of NMT, one of the most important is translation of natural language. A distinguishing factor of natural language is that words are typically ordered according to the rules of the grammar of a given language. Although many advances have been made in developing NMT systems for translating natural language, little research has been done on understanding how the word ordering of and lexical similarity between the source and target language affect translation performance. Here, we investigate these relationships on a variety of low-resource language pairs from the OpenSubtitles2016 database, where the source language is English, and find that the more similar the target language is to English, the greater the translation performance. In addition, we study the impact of providing NMT models with part of speech of words (POS) in the English sequence and find that, for Transformer-based models, the more dissimilar the target language is from English, the greater the benefit provided by POS.
    Covariate-assisted Sparse Tensor Completion. (arXiv:2103.06428v2 [stat.ML] UPDATED)
    (2 min) We aim to provably complete a sparse and highly-missing tensor in the presence of covariate information along tensor modes. Our motivation comes from online advertising where users click-through-rates (CTR) on ads over various devices form a CTR tensor that has about 96% missing entries and has many zeros on non-missing entries, which makes the standalone tensor completion method unsatisfactory. Beside the CTR tensor, additional ad features or user characteristics are often available. In this paper, we propose Covariate-assisted Sparse Tensor Completion (COSTCO) to incorporate covariate information for the recovery of the sparse tensor. The key idea is to jointly extract latent components from both the tensor and the covariate matrix to learn a synthetic representation. Theoretically, we derive the error bound for the recovered tensor components and explicitly quantify the improvements on both the reveal probability condition and the tensor recovery accuracy due to covariates. Finally, we apply COSTCO to an advertisement dataset consisting of a CTR tensor and ad covariate matrix, leading to 23% accuracy improvement over the baseline. An important by-product is that ad latent components from COSTCO reveal interesting ad clusters, which are useful for better ad targeting.
    A General Descent Aggregation Framework for Gradient-based Bi-level Optimization. (arXiv:2102.07976v3 [cs.LG] UPDATED)
    (0 min) In recent years, a variety of gradient-based methods have been developed to solve Bi-Level Optimization (BLO) problems in machine learning and computer vision areas. However, the theoretical correctness and practical effectiveness of these existing approaches always rely on some restrictive conditions (e.g., Lower-Level Singleton, LLS), which could hardly be satisfied in real-world applications. Moreover, previous literature only proves theoretical results based on their specific iteration strategies, thus lack a general recipe to uniformly analyze the convergence behaviors of different gradient-based BLOs. In this work, we formulate BLOs from an optimistic bi-level viewpoint and establish a new gradient-based algorithmic framework, named Bi-level Descent Aggregation (BDA), to partially address the above issues. Specifically, BDA provides a modularized structure to hierarchically aggregate both the upper- and lower-level subproblems to generate our bi-level iterative dynamics. Theoretically, we establish a general convergence analysis template and derive a new proof recipe to investigate the essential theoretical properties of gradient-based BLO methods. Furthermore, this work systematically explores the convergence behavior of BDA in different optimization scenarios, i.e., considering various solution qualities (i.e., global/local/stationary solution) returned from solving approximation subproblems. Extensive experiments justify our theoretical results and demonstrate the superiority of the proposed algorithm for hyper-parameter optimization and meta-learning tasks. Source code is available at https://github.com/vis-opt-group/BDA.
    Succinct Differentiation of Disparate Boosting Ensemble Learning Methods for Prognostication of Polycystic Ovary Syndrome Diagnosis. (arXiv:2201.00418v1 [cs.LG])
    (0 min) Prognostication of medical problems using the clinical data by leveraging the Machine Learning techniques with stellar precision is one of the most important real world challenges at the present time. Considering the medical problem of Polycystic Ovary Syndrome also known as PCOS is an emerging problem in women aged from 15 to 49. Diagnosing this disorder by using various Boosting Ensemble Methods is something we have presented in this paper. A detailed and compendious differentiation between Adaptive Boost, Gradient Boosting Machine, XGBoost and CatBoost with their respective performance metrics highlighting the hidden anomalies in the data and its effects on the result is something we have presented in this paper. Metrics like Confusion Matrix, Precision, Recall, F1 Score, FPR, RoC Curve and AUC have been used in this paper.
    BARACK: Partially Supervised Group Robustness With Guarantees. (arXiv:2201.00072v1 [cs.LG])
    (2 min) While neural networks have shown remarkable success on classification tasks in terms of average-case performance, they often fail to perform well on certain groups of the data. Such group information may be expensive to obtain; thus, recent works in robustness and fairness have proposed ways to improve worst-group performance even when group labels are unavailable for the training data. However, these methods generally underperform methods that utilize group information at training time. In this work, we assume access to a small number of group labels alongside a larger dataset without group labels. We propose BARACK, a simple two-step framework to utilize this partial group information to improve worst-group performance: train a model to predict the missing group labels for the training data, and then use these predicted group labels in a robust optimization objective. Theoretically, we provide generalization bounds for our approach in terms of the worst-group performance, showing how the generalization error scales with respect to both the total number of training points and the number of training points with group labels. Empirically, our method outperforms the baselines that do not use group information, even when only 1-33% of points have group labels. We provide ablation studies to support the robustness and extensibility of our framework.
    Dynamic Least-Squares Regression. (arXiv:2201.00228v1 [cs.DS])
    (2 min) A common challenge in large-scale supervised learning, is how to exploit new incremental data to a pre-trained model, without re-training the model from scratch. Motivated by this problem, we revisit the canonical problem of dynamic least-squares regression (LSR), where the goal is to learn a linear model over incremental training data. In this setup, data and labels $(\mathbf{A}^{(t)}, \mathbf{b}^{(t)}) \in \mathbb{R}^{t \times d}\times \mathbb{R}^t$ evolve in an online fashion ($t\gg d$), and the goal is to efficiently maintain an (approximate) solution to $\min_{\mathbf{x}^{(t)}} \| \mathbf{A}^{(t)} \mathbf{x}^{(t)} - \mathbf{b}^{(t)} \|_2$ for all $t\in [T]$. Our main result is a dynamic data structure which maintains an arbitrarily small constant approximate solution to dynamic LSR with amortized update time $O(d^{1+o(1)})$, almost matching the running time of the static (sketching-based) solution. By contrast, for exact (or even $1/\mathrm{poly}(n)$-accuracy) solutions, we show a separation between the static and dynamic settings, namely, that dynamic LSR requires $\Omega(d^{2-o(1)})$ amortized update time under the OMv Conjecture (Henzinger et al., STOC'15). Our data structure is conceptually simple, easy to implement, and fast both in theory and practice, as corroborated by experiments over both synthetic and real-world datasets.
    Evaluating Deep Music Generation Methods Using Data Augmentation. (arXiv:2201.00052v1 [cs.SD])
    (2 min) Despite advances in deep algorithmic music generation, evaluation of generated samples often relies on human evaluation, which is subjective and costly. We focus on designing a homogeneous, objective framework for evaluating samples of algorithmically generated music. Any engineered measures to evaluate generated music typically attempt to define the samples' musicality, but do not capture qualities of music such as theme or mood. We do not seek to assess the musical merit of generated music, but instead explore whether generated samples contain meaningful information pertaining to emotion or mood/theme. We achieve this by measuring the change in predictive performance of a music mood/theme classifier after augmenting its training data with generated samples. We analyse music samples generated by three models -- SampleRNN, Jukebox, and DDSP -- and employ a homogeneous framework across all methods to allow for objective comparison. This is the first attempt at augmenting a music genre classification dataset with conditionally generated music. We investigate the classification performance improvement using deep music generation and the ability of the generators to make emotional music by using an additional, emotion annotation of the dataset. Finally, we use a classifier trained on real data to evaluate the label validity of class-conditionally generated samples.
    Multi-view Subspace Adaptive Learning via Autoencoder and Attention. (arXiv:2201.00171v1 [cs.LG])
    (2 min) Multi-view learning can cover all features of data samples more comprehensively, so multi-view learning has attracted widespread attention. Traditional subspace clustering methods, such as sparse subspace clustering (SSC) and low-ranking subspace clustering (LRSC), cluster the affinity matrix for a single view, thus ignoring the problem of fusion between views. In our article, we propose a new Multiview Subspace Adaptive Learning based on Attention and Autoencoder (MSALAA). This method combines a deep autoencoder and a method for aligning the self-representations of various views in Multi-view Low-Rank Sparse Subspace Clustering (MLRSSC), which can not only increase the capability to non-linearity fitting, but also can meets the principles of consistency and complementarity of multi-view learning. We empirically observe significant improvement over existing baseline methods on six real-life datasets.
    Semantic Search for Large Scale Clinical Ontologies. (arXiv:2201.00118v1 [cs.CL])
    (2 min) Finding concepts in large clinical ontologies can be challenging when queries use different vocabularies. A search algorithm that overcomes this problem is useful in applications such as concept normalisation and ontology matching, where concepts can be referred to in different ways, using different synonyms. In this paper, we present a deep learning based approach to build a semantic search system for large clinical ontologies. We propose a Triplet-BERT model and a method that generates training data directly from the ontologies. The model is evaluated using five real benchmark data sets and the results show that our approach achieves high results on both free text to concept and concept to concept searching tasks, and outperforms all baseline methods.
    Towards Robust Graph Neural Networks for Noisy Graphs with Sparse Labels. (arXiv:2201.00232v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have shown their great ability in modeling graph structured data. However, real-world graphs usually contain structure noises and have limited labeled nodes. The performance of GNNs would drop significantly when trained on such graphs, which hinders the adoption of GNNs on many applications. Thus, it is important to develop noise-resistant GNNs with limited labeled nodes. However, the work on this is rather limited. Therefore, we study a novel problem of developing robust GNNs on noisy graphs with limited labeled nodes. Our analysis shows that both the noisy edges and limited labeled nodes could harm the message-passing mechanism of GNNs. To mitigate these issues, we propose a novel framework which adopts the noisy edges as supervision to learn a denoised and dense graph, which can down-weight or eliminate noisy edges and facilitate message passing of GNNs to alleviate the issue of limited labeled nodes. The generated edges are further used to regularize the predictions of unlabeled nodes with label smoothness to better train GNNs. Experimental results on real-world datasets demonstrate the robustness of the proposed framework on noisy graphs with limited labeled nodes.
    Hybrid intelligence for dynamic job-shop scheduling with deep reinforcement learning and attention mechanism. (arXiv:2201.00548v1 [cs.AI])
    (2 min) The dynamic job-shop scheduling problem (DJSP) is a class of scheduling tasks that specifically consider the inherent uncertainties such as changing order requirements and possible machine breakdown in realistic smart manufacturing settings. Since traditional methods cannot dynamically generate effective scheduling strategies in face of the disturbance of environments, we formulate the DJSP as a Markov decision process (MDP) to be tackled by reinforcement learning (RL). For this purpose, we propose a flexible hybrid framework that takes disjunctive graphs as states and a set of general dispatching rules as the action space with minimum prior domain knowledge. The attention mechanism is used as the graph representation learning (GRL) module for the feature extraction of states, and the double dueling deep Q-network with prioritized replay and noisy networks (D3QPN) is employed to map each state to the most appropriate dispatching rule. Furthermore, we present Gymjsp, a public benchmark based on the well-known OR-Library, to provide a standardized off-the-shelf facility for RL and DJSP research communities. Comprehensive experiments on various DJSP instances confirm that our proposed framework is superior to baseline algorithms with smaller makespan across all instances and provide empirical justification for the validity of the various components in the hybrid framework.
    Knowledge intensive state design for traffic signal control. (arXiv:2201.00006v1 [cs.LG])
    (2 min) There is a general trend of applying reinforcement learning (RL) techniques for traffic signal control (TSC). Recently, most studies pay attention to the neural network design and rarely concentrate on the state representation. Does the design of state representation has a good impact on TSC? In this paper, we (1) propose an effective state representation as queue length of vehicles with intensive knowledge; (2) present a TSC method called MaxQueue based on our state representation approach; (3) develop a general RL-based TSC template called QL-XLight with queue length as state and reward and generate QL-FRAP, QL-CoLight, and QL-DQN by our QL-XLight template based on traditional and latest RL models.Through comprehensive experiments on multiple real-world datasets, we demonstrate that: (1) our MaxQueue method outperforms the latest RL based methods; (2) QL-FRAP and QL-CoLight achieves a new state-of-the-art (SOTA). In general, state representation with intensive knowledge is also essential for TSC methods. Our code is released on Github.
    ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. (arXiv:2201.00382v1 [cs.LG])
    (2 min) Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
    Thinking inside the box: A tutorial on grey-box Bayesian optimization. (arXiv:2201.00272v1 [cs.LG])
    (2 min) Bayesian optimization (BO) is a framework for global optimization of expensive-to-evaluate objective functions. Classical BO methods assume that the objective function is a black box. However, internal information about objective function computation is often available. For example, when optimizing a manufacturing line's throughput with simulation, we observe the number of parts waiting at each workstation, in addition to the overall throughput. Recent BO methods leverage such internal information to dramatically improve performance. We call these "grey-box" BO methods because they treat objective computation as partially observable and even modifiable, blending the black-box approach with so-called "white-box" first-principles knowledge of objective function computation. This tutorial describes these methods, focusing on BO of composite objective functions, where one can observe and selectively evaluate individual constituents that feed into the overall objective; and multi-fidelity BO, where one can evaluate cheaper approximations of the objective function by varying parameters of the evaluation oracle.
    Transfer-learning-based Surrogate Model for Thermal Conductivity of Nanofluids. (arXiv:2201.00435v1 [physics.flu-dyn])
    (2 min) Heat transfer characteristics of nanofluids have been extensively studied since the 1990s. Research investigations show that the suspended nanoparticles significantly alter the suspension's thermal properties. The thermal conductivity of nanofluids is one of the properties that is generally found to be greater than that of the base fluid. This increase in thermal conductivity is found to depend on several parameters. Several theories have been proposed to model the thermal conductivities of nanofluids, but there is no reliable universal theory yet to model the anomalous thermal conductivity of nanofluids. In recent years, supervised data-driven methods have been successfully employed to create surrogate models across various scientific disciplines, especially for modeling difficult-to-understand phenomena. These supervised learning methods allow the models to capture highly non-linear phenomena. In this work, we have taken advantage of existing correlations and used them concurrently with available experimental results to develop more robust surrogate models for predicting the thermal conductivity of nanofluids. Artificial neural networks are trained using the transfer learning approach to predict the thermal conductivity enhancement of nanofluids with spherical particles for 32 different particle-fluid combinations (8 particles materials and 4 fluids). The large amount of lower accuracy data generated from correlations is used to coarse-tune the model parameters, and the limited amount of more trustworthy experimental data is used to fine-tune the model parameters. The transfer learning-based models' results are compared with those from baseline models which are trained only on experimental data using a goodness of fit metric. It is found that the transfer learning models perform better with goodness of fit values of 0.93 as opposed to 0.83 from the baseline models.
    TransLog: A Unified Transformer-based Framework for Log Anomaly Detection. (arXiv:2201.00016v1 [cs.LG])
    (2 min) Log anomaly detection is a key component in the field of artificial intelligence for IT operations (AIOps). Considering log data of variant domains, retraining the whole network for unknown domains is inefficient in real industrial scenarios especially for low-resource domains. However, previous deep models merely focused on extracting the semantics of log sequence in the same domain, leading to poor generalization on multi-domain logs. Therefore, we propose a unified Transformer-based framework for log anomaly detection (\ourmethod{}), which is comprised of the pretraining and adapter-based tuning stage. Our model is first pretrained on the source domain to obtain shared semantic knowledge of log data. Then, we transfer the pretrained model to the target domain via the adapter-based tuning. The proposed method is evaluated on three public datasets including one source domain and two target domains. The experimental results demonstrate that our simple yet efficient approach, with fewer trainable parameters and lower training costs in the target domain, achieves state-of-the-art performance on three benchmarks.
    Overview of the EEG Pilot Subtask at MediaEval 2021: Predicting Media Memorability. (arXiv:2201.00620v1 [q-bio.NC])
    (2 min) The aim of the Memorability-EEG pilot subtask at MediaEval'2021 is to promote interest in the use of neural signals -- either alone or in combination with other data sources -- in the context of predicting video memorability by highlighting the utility of EEG data. The dataset created consists of pre-extracted features from EEG recordings of subjects while watching a subset of videos from Predicting Media Memorability subtask 1. This demonstration pilot gives interested researchers a sense of how neural signals can be used without any prior domain knowledge, and enables them to do so in a future memorability task. The dataset can be used to support the exploration of novel machine learning and processing strategies for predicting video memorability, while potentially increasing interdisciplinary interest in the subject of memorability, and opening the door to new combined EEG-computer vision approaches.
    An analysis of over-sampling labeled data in semi-supervised learning with FixMatch. (arXiv:2201.00604v1 [cs.LG])
    (2 min) Most semi-supervised learning methods over-sample labeled data when constructing training mini-batches. This paper studies whether this common practice improves learning and how. We compare it to an alternative setting where each mini-batch is uniformly sampled from all the training data, labeled or not, which greatly reduces direct supervision from true labels in typical low-label regimes. However, this simpler setting can also be seen as more general and even necessary in multi-task problems where over-sampling labeled data would become intractable. Our experiments on semi-supervised CIFAR-10 image classification using FixMatch show a performance drop when using the uniform sampling approach which diminishes when the amount of labeled data or the training time increases. Further, we analyse the training dynamics to understand how over-sampling of labeled data compares to uniform sampling. Our main finding is that over-sampling is especially beneficial early in training but gets less important in the later stages when more pseudo-labels become correct. Nevertheless, we also find that keeping some true labels remains important to avoid the accumulation of confirmation errors from incorrect pseudo-labels.
    The GatedTabTransformer. An enhanced deep learning architecture for tabular modeling. (arXiv:2201.00199v1 [cs.LG])
    (2 min) There is an increasing interest in the application of deep learning architectures to tabular data. One of the state-of-the-art solutions is TabTransformer which incorporates an attention mechanism to better track relationships between categorical features and then makes use of a standard MLP to output its final logits. In this paper we propose multiple modifications to the original TabTransformer performing better on binary classification tasks for three separate datasets with more than 1% AUROC gains. Inspired by gated MLP, linear projections are implemented in the MLP block and multiple activation functions are tested. We also evaluate the importance of specific hyper parameters during training.
    Optimal Representations for Covariate Shift. (arXiv:2201.00057v1 [cs.LG])
    (2 min) Machine learning systems often experience a distribution shift between training and testing. In this paper, we introduce a simple variational objective whose optima are exactly the set of all representations on which risk minimizers are guaranteed to be robust to any distribution shift that preserves the Bayes predictor, e.g., covariate shifts. Our objective has two components. First, a representation must remain discriminative for the task, i.e., some predictor must be able to simultaneously minimize the source and target risk. Second, the representation's marginal support needs to be the same across source and target. We make this practical by designing self-supervised learning methods that only use unlabelled data and augmentations to train robust representations. Our objectives achieve state-of-the-art results on DomainBed, and give insights into the robustness of recent methods, such as CLIP.
    Revisiting PGD Attacks for Stability Analysis of Large-Scale Nonlinear Systems and Perception-Based Control. (arXiv:2201.00801v1 [math.OC])
    (2 min) Many existing region-of-attraction (ROA) analysis tools find difficulty in addressing feedback systems with large-scale neural network (NN) policies and/or high-dimensional sensing modalities such as cameras. In this paper, we tailor the projected gradient descent (PGD) attack method developed in the adversarial learning community as a general-purpose ROA analysis tool for large-scale nonlinear systems and end-to-end perception-based control. We show that the ROA analysis can be approximated as a constrained maximization problem whose goal is to find the worst-case initial condition which shifts the terminal state the most. Then we present two PGD-based iterative methods which can be used to solve the resultant constrained maximization problem. Our analysis is not based on Lyapunov theory, and hence requires minimum information of the problem structures. In the model-based setting, we show that the PGD updates can be efficiently performed using back-propagation. In the model-free setting (which is more relevant to ROA analysis of perception-based control), we propose a finite-difference PGD estimate which is general and only requires a black-box simulator for generating the trajectories of the closed-loop system given any initial state. We demonstrate the scalability and generality of our analysis tool on several numerical examples with large-scale NN policies and high-dimensional image observations. We believe that our proposed analysis serves as a meaningful initial step toward further understanding of closed-loop stability of large-scale nonlinear systems and perception-based control.
    Stochastic convex optimization for provably efficient apprenticeship learning. (arXiv:2201.00039v1 [cs.LG])
    (2 min) We consider large-scale Markov decision processes (MDPs) with an unknown cost function and employ stochastic convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations. We adopt the apprenticeship learning formalism, which carries the assumption that the true cost function can be represented as a linear combination of some known features. Existing inverse reinforcement learning algorithms come with strong theoretical guarantees, but are computationally expensive because they use reinforcement learning or planning algorithms as a subroutine. On the other hand, state-of-the-art policy gradient based algorithms (like IM-REINFORCE, IM-TRPO, and GAIL), achieve significant empirical success in challenging benchmark tasks, but are not well understood in terms of theory. With an emphasis on non-asymptotic guarantees of performance, we propose a method that directly learns a policy from expert demonstrations, bypassing the intermediate step of learning the cost function, by formulating the problem as a single convex optimization problem over occupancy measures. We develop a computationally efficient algorithm and derive high confidence regret bounds on the quality of the extracted policy, utilizing results from stochastic convex optimization and recent works in approximate linear programming for solving forward MDPs.
    Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces. (arXiv:2201.00217v1 [stat.ML])
    (2 min) Learning operators between infinitely dimensional spaces is an important learning task arising in wide applications in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.
    Experiment Based Crafting and Analyzing of Machine Learning Solutions. (arXiv:2201.00355v1 [cs.LG])
    (2 min) The crafting of machine learning (ML) based systems requires statistical control throughout its life cycle. Careful quantification of business requirements and identification of key factors that impact the business requirements reduces the risk of a project failure. The quantification of business requirements results in the definition of random variables representing the system key performance indicators that need to be analyzed through statistical experiments. In addition, available data for training and experiment results impact the design of the system. Once the system is developed, it is tested and continually monitored to ensure it meets its business requirements. This is done through the continued application of statistical experiments to analyze and control the key performance indicators. This book teaches the art of crafting and developing ML based systems. It advocates an "experiment first" approach stressing the need to define statistical experiments from the beginning of the project life cycle. It also discusses in detail how to apply statistical control on the ML based system throughout its lifecycle.
    Croesus: Multi-Stage Processing and Transactions for Video-Analytics in Edge-Cloud Systems. (arXiv:2201.00063v1 [eess.SY])
    (2 min) Emerging edge applications require both a fast response latency and complex processing. This is infeasible without expensive hardware that can process complex operations -- such as object detection -- within a short time. Many approach this problem by addressing the complexity of the models -- via model compression, pruning and quantization -- or compressing the input. In this paper, we propose a different perspective when addressing the performance challenges. Croesus is a multi-stage approach to edge-cloud systems that provides the ability to find the balance between accuracy and performance. Croesus consists of two stages (that can be generalized to multiple stages): an initial and a final stage. The initial stage performs the computation in real-time using approximate/best-effort computation at the edge. The final stage performs the full computation at the cloud, and uses the results to correct any errors made at the initial stage. In this paper, we demonstrate the implications of such an approach on a video analytics use-case and show how multi-stage processing yields a better balance between accuracy and performance. Moreover, we study the safety of multi-stage transactions via two proposals: multi-stage serializability (MS-SR) and multi-stage invariant confluence with Apologies (MS-IA).
    Actor-Critic Network for Q&A in an Adversarial Environment. (arXiv:2201.00455v1 [cs.CL])
    (2 min) Significant work has been placed in the Q&A NLP space to build models that are more robust to adversarial attacks. Two key areas of focus are in generating adversarial data for the purposes of training against these situations or modifying existing architectures to build robustness within. This paper introduces an approach that joins these two ideas together to train a critic model for use in an almost reinforcement learning framework. Using the Adversarial SQuAD "Add One Sent" dataset we show that there are some promising signs for this method in protecting against Adversarial attacks.
    DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents. (arXiv:2201.00308v1 [cs.LG])
    (2 min) Diffusion Probabilistic models have been shown to generate state-of-the-art results on several competitive image synthesis benchmarks but lack a low-dimensional, interpretable latent space, and are slow at generation. On the other hand, Variational Autoencoders (VAEs) typically have access to a low-dimensional latent space but exhibit poor sample quality. Despite recent advances, VAEs usually require high-dimensional hierarchies of the latent codes to generate high-quality samples. We present DiffuseVAE, a novel generative framework that integrates VAE within a diffusion model framework, and leverage this to design a novel conditional parameterization for diffusion models. We show that the resulting model can improve upon the unconditional diffusion model in terms of sampling efficiency while also equipping diffusion models with the low-dimensional VAE inferred latent code. Furthermore, we show that the proposed model can generate high-resolution samples and exhibits synthesis quality comparable to state-of-the-art models on standard benchmarks. Lastly, we show that the proposed method can be used for controllable image synthesis and also exhibits out-of-the-box capabilities for downstream tasks like image super-resolution and denoising. For reproducibility, our source code is publicly available at \url{https://github.com/kpandey008/DiffuseVAE}.
  • stat.ML updates on arXiv.org

    A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models. (arXiv:2106.12887v2 [cs.LG] UPDATED)
    (2 min) We present a scalable post-processing algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. We empirically validate its advantages on standard benchmark datasets across both classical algorithms as well as modern DNN architectures and demonstrate that it outperforms previous post-processing methods while performing on par with in-processing. In addition, we show that the proposed algorithm is particularly effective for models trained at scale where post-processing is a natural and practical choice.
    Scalable semi-supervised dimensionality reduction with GPU-accelerated EmbedSOM. (arXiv:2201.00701v1 [cs.LG])
    (2 min) Dimensionality reduction methods have found vast application as visualization tools in diverse areas of science. Although many different methods exist, their performance is often insufficient for providing quick insight into many contemporary datasets, and the unsupervised mode of use prevents the users from utilizing the methods for dataset exploration and fine-tuning the details for improved visualization quality. We present BlosSOM, a high-performance semi-supervised dimensionality reduction software for interactive user-steerable visualization of high-dimensional datasets with millions of individual data points. BlosSOM builds on a GPU-accelerated implementation of the EmbedSOM algorithm, complemented by several landmark-based algorithms for interfacing the unsupervised model learning algorithms with the user supervision. We show the application of BlosSOM on realistic datasets, where it helps to produce high-quality visualizations that incorporate user-specified layout and focus on certain features. We believe the semi-supervised dimensionality reduction will improve the data visualization possibilities for science areas such as single-cell cytometry, and provide a fast and efficient base methodology for new directions in dataset exploration and annotation.
    Optimal Sampling Gaps for Adaptive Submodular Maximization. (arXiv:2104.01750v4 [cs.LG] UPDATED)
    (2 min) Running machine learning algorithms on large and rapidly growing volumes of data is often computationally expensive, one common trick to reduce the size of a data set, and thus reduce the computational cost of machine learning algorithms, is \emph{probability sampling}. It creates a sampled data set by including each data point from the original data set with a known probability. Although the benefit of running machine learning algorithms on the reduced data set is obvious, one major concern is that the performance of the solution obtained from samples might be much worse than that of the optimal solution when using the full data set. In this paper, we examine the performance loss caused by probability sampling in the context of adaptive submodular maximization. We consider a simple probability sampling method which selects each data point with probability $r\in[0,1]$. If we set the sampling rate $r=1$, our problem reduces to finding a solution based on the original full data set. We define sampling gap as the largest ratio between the optimal solution obtained from the full data set and the optimal solution obtained from the samples, over independence systems. %It captures the performance loss of the optimal solution caused by the probability sampling. Our main contribution is to show that if the utility function is policywise submodular, then for a given sampling rate $r$, the sampling gap is both upper bounded and lower bounded by $1/r$. One immediate implication of our result is that if we can find an $\alpha$-approximation solution based on a sampled data set (which is sampled at sampling rate $r$), then this solution achieves an $\alpha r$ approximation ratio against the optimal solution when using the full data set.
    On the Provable Generalization of Recurrent Neural Networks. (arXiv:2109.14142v3 [cs.LG] UPDATED)
    (2 min) Recurrent Neural Network (RNN) is a fundamental structure in deep learning. Recently, some works study the training process of over-parameterized neural networks, and show that over-parameterized networks can learn functions in some notable concept classes with a provable generalization error bound. In this paper, we analyze the training and generalization for RNNs with random initialization, and provide the following improvements over recent works: 1) For a RNN with input sequence $x=(X_1,X_2,...,X_L)$, previous works study to learn functions that are summation of $f(\beta^T_lX_l)$ and require normalized conditions that $||X_l||\leq\epsilon$ with some very small $\epsilon$ depending on the complexity of $f$. In this paper, using detailed analysis about the neural tangent kernel matrix, we prove a generalization error bound to learn such functions without normalized conditions and show that some notable concept classes are learnable with the numbers of iterations and samples scaling almost-polynomially in the input length $L$. 2) Moreover, we prove a novel result to learn N-variables functions of input sequence with the form $f(\beta^T[X_{l_1},...,X_{l_N}])$, which do not belong to the "additive" concept class, i,e., the summation of function $f(X_l)$. And we show that when either $N$ or $l_0=\max(l_1,..,l_N)-\min(l_1,..,l_N)$ is small, $f(\beta^T[X_{l_1},...,X_{l_N}])$ will be learnable with the number iterations and samples scaling almost-polynomially in the input length $L$.
    Online Regularization towards Always-Valid High-Dimensional Dynamic Pricing. (arXiv:2007.02470v2 [stat.ML] UPDATED)
    (2 min) Devising dynamic pricing policy with always valid online statistical learning procedure is an important and as yet unresolved problem. Most existing dynamic pricing policy, which focus on the faithfulness of adopted customer choice models, exhibit a limited capability for adapting the online uncertainty of learned statistical model during pricing process. In this paper, we propose a novel approach for designing dynamic pricing policy based regularized online statistical learning with theoretical guarantees. The new approach overcomes the challenge of continuous monitoring of online Lasso procedure and possesses several appealing properties. In particular, we make the decisive observation that the always-validity of pricing decisions builds and thrives on the online regularization scheme. Our proposed online regularization scheme equips the proposed optimistic online regularized maximum likelihood pricing (OORMLP) pricing policy with three major advantages: encode market noise knowledge into pricing process optimism; empower online statistical learning with always-validity over all decision points; envelop prediction error process with time-uniform non-asymptotic oracle inequalities. This type of non-asymptotic inference results allows us to design more sample-efficient and robust dynamic pricing algorithms in practice. In theory, the proposed OORMLP algorithm exploits the sparsity structure of high-dimensional models and secures a logarithmic regret in a decision horizon. These theoretical advances are made possible by proposing an optimistic online Lasso procedure that resolves dynamic pricing problems at the process level, based on a novel use of non-asymptotic martingale concentration. In experiments, we evaluate OORMLP in different synthetic and real pricing problem settings, and demonstrate that OORMLP advances the state-of-the-art methods.
    Conditional Selective Inference for Robust Regression and Outlier Detection using Piecewise-Linear Homotopy Continuation. (arXiv:2104.10840v2 [stat.ML] UPDATED)
    (2 min) In practical data analysis under noisy environment, it is common to first use robust methods to identify outliers, and then to conduct further analysis after removing the outliers. In this paper, we consider statistical inference of the model estimated after outliers are removed, which can be interpreted as a selective inference (SI) problem. To use conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing methods cannot be directly used here because they are applicable to the case where the selection events can be represented by linear/quadratic constraints. In this paper, we propose a conditional SI method for popular robust regressions by using homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.
    Application of Machine Learning Methods in Inferring Surface Water Groundwater Exchanges using High Temporal Resolution Temperature Measurements. (arXiv:2201.00726v1 [cs.LG])
    (2 min) We examine the ability of machine learning (ML) and deep learning (DL) algorithms to infer surface/ground exchange flux based on subsurface temperature observations. The observations and fluxes are produced from a high-resolution numerical model representing conditions in the Columbia River near the Department of Energy Hanford site located in southeastern Washington State. Random measurement error, of varying magnitude, is added to the synthetic temperature observations. The results indicate that both ML and DL methods can be used to infer the surface/ground exchange flux. DL methods, especially convolutional neural networks, outperform the ML methods when used to interpret noisy temperature data with a smoothing filter applied. However, the ML methods also performed well and they are can better identify a reduced number of important observations, which could be useful for measurement network optimization. Surprisingly, the ML and DL methods better inferred upward flux than downward flux. This is in direct contrast to previous findings using numerical models to infer flux from temperature observations and it may suggest that combined use of ML or DL inference with numerical inference could improve flux estimation beneath river systems.
    Provable Meta-Learning of Linear Representations. (arXiv:2002.11684v5 [cs.LG] UPDATED)
    (2 min) Meta-learning, or learning-to-learn, seeks to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments. Representation learning -- a key tool for performing meta-learning -- learns a data representation that can transfer knowledge across multiple tasks, which is essential in regimes where data is scarce. Despite a recent surge of interest in the practice of meta-learning, the theoretical underpinnings of meta-learning algorithms are lacking, especially in the context of learning transferable representations. In this paper, we focus on the problem of multi-task linear regression -- in which multiple linear regression models share a common, low-dimensional linear representation. Here, we provide provably fast, sample-efficient algorithms to address the dual challenges of (1) learning a common set of features from multiple, related tasks, and (2) transferring this knowledge to new, unseen tasks. Both are central to the general problem of meta-learning. Finally, we complement these results by providing information-theoretic lower bounds on the sample complexity of learning these linear features.
    Statistical Inference with Local Optima. (arXiv:1807.04431v2 [math.ST] UPDATED)
    (2 min) We study the statistical properties of an estimator derived by applying a gradient ascent method with multiple initializations to a multi-modal likelihood function. We derive the population quantity that is the target of this estimator and study the properties of confidence intervals (CIs) constructed from asymptotic normality and the bootstrap approach. In particular, we analyze the coverage deficiency due to finite number of random initializations. We also investigate the CIs by inverting the likelihood ratio test, the score test, and the Wald test, and we show that the resulting CIs may be very different. We propose a two-sample test procedure even when the MLE is intractable. In addition, we analyze the performance of the EM algorithm under random initializations and derive the coverage of a CI with a finite number of initializations.
    Kernel Two-Sample Tests in High Dimension: Interplay Between Moment Discrepancy and Dimension-and-Sample Orders. (arXiv:2201.00073v1 [math.ST])
    (2 min) Motivated by the increasing use of kernel-based metrics for high-dimensional and large-scale data, we study the asymptotic behavior of kernel two-sample tests when the dimension and sample sizes both diverge to infinity. We focus on the maximum mean discrepancy (MMD) with the kernel of the form $k(x,y)=f(\|x-y\|_{2}^{2}/\gamma)$, including MMD with the Gaussian kernel and the Laplacian kernel, and the energy distance as special cases. We derive asymptotic expansions of the kernel two-sample statistics, based on which we establish the central limit theorem (CLT) under both the null hypothesis and the local and fixed alternatives. The new non-null CLT results allow us to perform asymptotic exact power analysis, which reveals a delicate interplay between the moment discrepancy that can be detected by the kernel two-sample tests and the dimension-and-sample orders. The asymptotic theory is further corroborated through numerical studies. Our findings complement those in the recent literature and shed new light on the use of kernel two-sample tests for high-dimensional and large-scale data.
    On Robust Probabilistic Principal Component Analysis using Multivariate $t$-Distributions. (arXiv:2010.10786v2 [stat.ME] UPDATED)
    (2 min) Probabilistic principal component analysis (PPCA) is a probabilistic reformulation of principal component analysis (PCA), under the framework of a Gaussian latent variable model. To improve the robustness of PPCA, it has been proposed to change the underlying Gaussian distributions to multivariate $t$-distributions. Based on the representation of $t$-distribution as a scale mixture of Gaussian distributions, a hierarchical model is used for implementation. However, in the existing literature, the hierarchical model implemented does not yield the equivalent interpretation. In this paper, we present two sets of equivalent relationships between the high-level multivariate $t$-PPCA framework and the hierarchical model used for implementation. In doing so, we clarify a current misrepresentation in the literature, by specifying the correct correspondence. In addition, we discuss the performance of different multivariate $t$ robust PPCA methods both in theory and simulation studies, and propose a novel Monte Carlo expectation-maximization (MCEM) algorithm to implement one general type of such models.
    Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces. (arXiv:2201.00217v1 [stat.ML])
    (2 min) Learning operators between infinitely dimensional spaces is an important learning task arising in wide applications in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.
    Global convergence of optimized adaptive importance samplers. (arXiv:2201.00409v1 [stat.CO])
    (2 min) We analyze the optimized adaptive importance sampler (OAIS) for performing Monte Carlo integration with general proposals. We leverage a classical result which shows that the bias and the mean-squared error (MSE) of the importance sampling scales with the $\chi^2$-divergence between the target and the proposal and develop a scheme which performs global optimization of $\chi^2$-divergence. While it is known that this quantity is convex for exponential family proposals, the case of the general proposals has been an open problem. We close this gap by utilizing stochastic gradient Langevin dynamics (SGLD) and its underdamped counterpart for the global optimization of $\chi^2$-divergence and derive nonasymptotic bounds for the MSE by leveraging recent results from non-convex optimization literature. The resulting AIS schemes have explicit theoretical guarantees uniform in the number of iterations.
    ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. (arXiv:2201.00382v1 [cs.LG])
    (2 min) Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.
    Cluster Stability Selection. (arXiv:2201.00494v1 [stat.ME])
    (2 min) Stability selection (Meinshausen and Buhlmann, 2010) makes any feature selection method more stable by returning only those features that are consistently selected across many subsamples. We prove (in what is, to our knowledge, the first result of its kind) that for data containing highly correlated proxies for an important latent variable, the lasso typically selects one proxy, yet stability selection with the lasso can fail to select any proxy, leading to worse predictive performance than the lasso alone. We introduce cluster stability selection, which exploits the practitioner's knowledge that highly correlated clusters exist in the data, resulting in better feature rankings than stability selection in this setting. We consider several feature-combination approaches, including taking a weighted average of the features in each important cluster where weights are determined by the frequency with which cluster members are selected, which we show leads to better predictive models than previous proposals. We present generalizations of theoretical guarantees from Meinshausen and Buhlmann (2010) and Shah and Samworth (2012) to show that cluster stability selection retains the same guarantees. In summary, cluster stability selection enjoys the best of both worlds, yielding a sparse selected set that is both stable and has good predictive performance.
    Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures. (arXiv:2104.13628v2 [cs.LG] UPDATED)
    (2 min) Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
    An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification. (arXiv:2112.13236v2 [cs.CR] UPDATED)
    (2 min) Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has reached the state-of-the-art evaluation scores on three out of four datasets, particularly state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.
    Covariate-assisted Sparse Tensor Completion. (arXiv:2103.06428v2 [stat.ML] UPDATED)
    (2 min) We aim to provably complete a sparse and highly-missing tensor in the presence of covariate information along tensor modes. Our motivation comes from online advertising where users click-through-rates (CTR) on ads over various devices form a CTR tensor that has about 96% missing entries and has many zeros on non-missing entries, which makes the standalone tensor completion method unsatisfactory. Beside the CTR tensor, additional ad features or user characteristics are often available. In this paper, we propose Covariate-assisted Sparse Tensor Completion (COSTCO) to incorporate covariate information for the recovery of the sparse tensor. The key idea is to jointly extract latent components from both the tensor and the covariate matrix to learn a synthetic representation. Theoretically, we derive the error bound for the recovered tensor components and explicitly quantify the improvements on both the reveal probability condition and the tensor recovery accuracy due to covariates. Finally, we apply COSTCO to an advertisement dataset consisting of a CTR tensor and ad covariate matrix, leading to 23% accuracy improvement over the baseline. An important by-product is that ad latent components from COSTCO reveal interesting ad clusters, which are useful for better ad targeting.
    DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning. (arXiv:2106.03760v3 [cs.LG] UPDATED)
    (2 min) The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to $128$ tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over $22\%$ improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.
    Class-Incremental Continual Learning into the eXtended DER-verse. (arXiv:2201.00766v1 [cs.LG])
    (2 min) The staple of human intelligence is the capability of acquiring knowledge in a continuous fashion. In stark contrast, Deep Networks forget catastrophically and, for this reason, the sub-field of Class-Incremental Continual Learning fosters methods that learn a sequence of tasks incrementally, blending sequentially-gained knowledge into a comprehensive prediction. This work aims at assessing and overcoming the pitfalls of our previous proposal Dark Experience Replay (DER), a simple and effective approach that combines rehearsal and Knowledge Distillation. Inspired by the way our minds constantly rewrite past recollections and set expectations for the future, we endow our model with the abilities to i) revise its replay memory to welcome novel information regarding past data ii) pave the way for learning yet unseen classes. We show that the application of these strategies leads to remarkable improvements; indeed, the resulting method - termed eXtended-DER (X-DER) - outperforms the state of the art on both standard benchmarks (such as CIFAR-100 and miniImagenet) and a novel one here introduced. To gain a better understanding, we further provide extensive ablation studies that corroborate and extend the findings of our previous research (e.g. the value of Knowledge Distillation and flatter minima in continual learning setups).
    Neural Network Middle-Term Probabilistic Forecasting of Daily Power Consumption. (arXiv:2006.16388v2 [stat.ME] UPDATED)
    (2 min) Middle-term horizon (months to a year) power consumption prediction is a main challenge in the energy sector, in particular when probabilistic forecasting is considered. We propose a new modelling approach that incorporates trend, seasonality and weather conditions, as explicative variables in a shallow Neural Network with an autoregressive feature. We obtain excellent results for density forecast on the one-year test set applying it to the daily power consumption in New England U.S.A.. The quality of the achieved power consumption probabilistic forecasting has been verified, on the one hand, comparing the results to other standard models for density forecasting and, on the other hand, considering measures that are frequently used in the energy sector as pinball loss and CI backtesting.
    Identify comorbidities associated with recurrent ED and in-patient visits. (arXiv:2110.13769v2 [stat.ML] UPDATED)
    (2 min) In the hospital setting, a small percentage of recurrent frequent patients contribute to a disproportional amount of healthcare resource usage. Moreover, in many of these cases, patient outcomes can be greatly improved by reducing reoccurring visits, especially when they are associated with substance abuse, mental health, and medical factors that could be improved by social-behavioral interventions, outpatient or preventative care. Additionally, health care costs can be reduced significantly with fewer preventable recurrent visits. To address this, we developed a computationally efficient and interpretable framework that both identifies recurrent patients with high utilization and determines which comorbidities contribute most to their recurrent visits. Specifically, we present a novel algorithm, called the minimum similarity association rules (MSAR), balancing confidence-support trade-off, to determine the conditions most associated with reoccurring Emergency department (ED) and inpatient visits. We validate MSAR on a large Electric Health Record (EHR) dataset.
    Linear Classifiers in Product Space Forms. (arXiv:2102.10204v3 [cs.LG] UPDATED)
    (2 min) Embedding methods for product spaces are powerful techniques for low-distortion and low-dimensional representation of complex data structures. Here, we address the new problem of linear classification in product space forms -- products of Euclidean, spherical, and hyperbolic spaces. First, we describe novel formulations for linear classifiers on a Riemannian manifold using geodesics and Riemannian metrics which generalize straight lines and inner products in vector spaces. Second, we prove that linear classifiers in $d$-dimensional space forms of any curvature have the same expressive power, i.e., they can shatter exactly $d+1$ points. Third, we formalize linear classifiers in product space forms, describe the first known perceptron and support vector machine classifiers for such spaces and establish rigorous convergence results for perceptrons. Moreover, we prove that the Vapnik-Chervonenkis dimension of linear classifiers in a product space form of dimension $d$ is \emph{at least} $d+1$. We support our theoretical findings with simulations on several datasets, including synthetic data, image data, and single-cell RNA sequencing (scRNA-seq) data. The results show that classification in low-dimensional product space forms for scRNA-seq data offers, on average, a performance improvement of $\sim15\%$ when compared to that in Euclidean spaces of the same dimension.
    Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints. (arXiv:2101.02195v2 [cs.LG] UPDATED)
    (2 min) We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: the batch learning model and the rare policy switch model, and propose two efficient online RL algorithms for episodic linear Markov decision processes, where the transition probability and the reward function can be represented as a linear function of some known feature mapping. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an $\tilde O(\sqrt{d^3H^3T} + dHT/B)$ regret, where $d$ is the dimension of the feature mapping, $H$ is the episode length, $T$ is the number of interactions and $B$ is the number of batches. Our result suggests that it suffices to use only $\sqrt{T/dH}$ batches to obtain $\tilde O(\sqrt{d^3H^3T})$ regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an $\tilde O(\sqrt{d^3H^3T[1+T/(dH)]^{dH/B}})$ regret, which implies that $dH\log T$ policy switches suffice to obtain the $\tilde O(\sqrt{d^3H^3T})$ regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2019), yet with a substantially smaller amount of adaptivity. We also establish a lower bound for the batch learning model, which suggests that the dependency on $B$ in our regret bound is tight.
    Boosting the Certified Robustness of L-infinity Distance Nets. (arXiv:2110.06850v3 [cs.LG] UPDATED)
    (2 min) Recently, Zhang et al.(2021) developed a new neural network architecture based on $\ell_\infty$-distance functions, which naturally possesses certified $\ell_\infty$ robustness by its construction. Despite rigorous theoretical guarantees, the model so far can only achieve comparable performance to conventional networks. In this paper, we make the following two contributions: $\mathrm{(i)}$ We demonstrate that $\ell_\infty$-distance nets enjoy a fundamental advantage in certified robustness over conventional networks (under typical certification approaches); $\mathrm{(ii)}$ With an improved training process we are able to significantly boost the certified accuracy of $\ell_\infty$-distance nets. Our training approach largely alleviates the optimization problem that arose in the previous training scheme, in particular, the unexpected large Lipschitz constant due to the use of a crucial trick called $\ell_p$-relaxation. The core of our training approach is a novel objective function that combines scaled cross-entropy loss and clipped hinge loss with a decaying mixing coefficient. Experiments show that using the proposed training strategy, the certified accuracy of $\ell_\infty$-distance net can be dramatically improved from 33.30% to 40.06% on CIFAR-10 ($\epsilon=8/255$), meanwhile outperforming other approaches in this area by a large margin. Our results clearly demonstrate the effectiveness and potential of $\ell_\infty$-distance net for certified robustness. Codes are available at https://github.com/zbh2047/L_inf-dist-net-v2.
    A General Framework for Treatment Effect Estimation in Semi-Supervised and High Dimensional Settings. (arXiv:2201.00468v1 [stat.ME])
    (2 min) In this article, we aim to provide a general and complete understanding of semi-supervised (SS) causal inference for treatment effects. Specifically, we consider two such estimands: (a) the average treatment effect and (b) the quantile treatment effect, as prototype cases, in an SS setting, characterized by two available data sets: (i) a labeled data set of size $n$, providing observations for a response and a set of high dimensional covariates, as well as a binary treatment indicator; and (ii) an unlabeled data set of size $N$, much larger than $n$, but without the response observed. Using these two data sets, we develop a family of SS estimators which are ensured to be: (1) more robust and (2) more efficient than their supervised counterparts based on the labeled data set only. Beyond the 'standard' double robustness results (in terms of consistency) that can be achieved by supervised methods as well, we further establish root-n consistency and asymptotic normality of our SS estimators whenever the propensity score in the model is correctly specified, without requiring specific forms of the nuisance functions involved. Such an improvement of robustness arises from the use of the massive unlabeled data, so it is generally not attainable in a purely supervised setting. In addition, our estimators are shown to be semi-parametrically efficient as long as all the nuisance functions are correctly specified. Moreover, as an illustration of the nuisance estimators, we consider inverse-probability-weighting type kernel smoothing estimators involving unknown covariate transformation mechanisms, and establish in high dimensional scenarios novel results on their uniform convergence rates, which should be of independent interest. Numerical results on both simulated and real data validate the advantage of our methods over their supervised counterparts with respect to both robustness and efficiency.
    Generalization Bounds for Noisy Iterative Algorithms Using Properties of Additive Noise Channels. (arXiv:2102.02976v3 [stat.ML] UPDATED)
    (2 min) Machine learning models trained by different optimization algorithms under different data distributions can exhibit distinct generalization behaviors. In this paper, we analyze the generalization of models trained by noisy iterative algorithms. We derive distribution-dependent generalization bounds by connecting noisy iterative algorithms to additive noise channels found in communication and information theory. Our generalization bounds shed light on several applications, including differentially private stochastic gradient descent (DP-SGD), federated learning, and stochastic gradient Langevin dynamics (SGLD). We demonstrate our bounds through numerical experiments, showing that they can help understand recent empirical observations of the generalization phenomena of neural networks.
    Minimum Excess Risk in Bayesian Learning. (arXiv:2012.14868v2 [cs.LG] UPDATED)
    (2 min) We analyze the best achievable performance of Bayesian learning under generative models by defining and upper-bounding the minimum excess risk (MER): the gap between the minimum expected loss attainable by learning from data and the minimum expected loss that could be achieved if the model realization were known. The definition of MER provides a principled way to define different notions of uncertainties in Bayesian learning, including the aleatoric uncertainty and the minimum epistemic uncertainty. Two methods for deriving upper bounds for the MER are presented. The first method, generally suitable for Bayesian learning with a parametric generative model, upper-bounds the MER by the conditional mutual information between the model parameters and the quantity being predicted given the observed data. It allows us to quantify the rate at which the MER decays to zero as more data becomes available. Under realizable models, this method also relates the MER to the richness of the generative function class, notably the VC dimension in binary classification. The second method, particularly suitable for Bayesian learning with a parametric predictive model, relates the MER to the minimum estimation error of the model parameters from data. It explicitly shows how the uncertainty in model parameter estimation translates to the MER and to the final prediction uncertainty. We also extend the definition and analysis of MER to the setting with multiple model families and the setting with nonparametric models. Along the discussions we draw some comparisons between the MER in Bayesian learning and the excess risk in frequentist learning.
    Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs. (arXiv:2010.00587v3 [cs.LG] UPDATED)
    (2 min) We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$\gamma$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$\gamma$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $\gamma$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tilde{\Omega}\big({\sqrt{SAT}}/{(1-\gamma)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$\gamma$ is nearly minimax optimal for discounted MDPs.
    The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization. (arXiv:2106.07769v3 [cs.LG] UPDATED)
    (2 min) Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.
    Hands-on Bayesian Neural Networks -- a Tutorial for Deep Learning Users. (arXiv:2007.06823v3 [cs.LG] UPDATED)
    (2 min) Modern deep learning methods constitute incredibly powerful tools to tackle a myriad of challenging problems. However, since deep learning methods operate as black boxes, the uncertainty associated with their predictions is often challenging to quantify. Bayesian statistics offer a formalism to understand and quantify the uncertainty associated with deep neural network predictions. This tutorial provides an overview of the relevant literature and a complete toolset to design, implement, train, use and evaluate Bayesian Neural Networks, i.e. Stochastic Artificial Neural Networks trained using Bayesian methods.
    Recover the spectrum of covariance matrix: a non-asymptotic iterative method. (arXiv:2201.00230v1 [stat.ML])
    (2 min) It is well known the sample covariance has a consistent bias in the spectrum, for example spectrum of Wishart matrix follows the Marchenko-Pastur law. We in this work introduce an iterative algorithm 'Concent' that actively eliminate this bias and recover the true spectrum for small and moderate dimensions.
    Statistical and Topological Properties of Sliced Probability Divergences. (arXiv:2003.05783v2 [stat.ML] UPDATED)
    (2 min) The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments.
    Fair Data Representation for Machine Learning at the Pareto Frontier. (arXiv:2201.00292v1 [stat.ML])
    (2 min) As machine learning powered decision making is playing an increasingly important role in our daily lives, it is imperative to strive for fairness of the underlying data processing and algorithms. We propose a pre-processing algorithm for fair data representation via which L2- objective supervised learning algorithms result in an estimation of the Pareto frontier between prediction error and statistical disparity. In particular, the present work applies the optimal positive definite affine transport maps to approach the post-processing Wasserstein barycenter characterization of the optimal fair L2-objective supervised learning via a pre-processing data deformation. We call the resulting data Wasserstein pseudo-barycenter. Furthermore, we show that the Wasserstein geodesics from the learning outcome marginals to the barycenter characterizes the Pareto frontier between L2-loss and total Wasserstein distance among learning outcome marginals. Thereby, an application of McCann interpolation generalizes the pseudo-barycenter to a family of data representations via which L2-objective supervised learning algorithms result in the Pareto frontier. Numerical simulations underscore the advantages of the proposed data representation: (1) the pre-processing step is compositive with arbitrary L2-objective supervised learning methods and unseen data; (2) the fair representation protects data privacy by preventing the training machine from direct or indirect access to the sensitive information of the data; (3) the optimal affine map results in efficient computation of fair supervised learning on high-dimensional data; (4) experimental results shed light on the fairness of L2-objective unsupervised learning via the proposed fair data representation.
    Latent structure blockmodels for Bayesian spectral graph clustering. (arXiv:2107.01734v2 [stat.ML] UPDATED)
    (2 min) Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding, but this is not explicitly accounted for in most statistical models for networks. In this article, a class of models called latent structure block models (LSBM) is proposed to address such scenarios, allowing for graph clustering when community-specific one dimensional manifold structure is present. LSBMs focus on a specific class of latent space model, the random dot product graph (RDPG), and assign a latent submanifold to the latent positions of each community. A Bayesian model for the embeddings arising from LSBMs is discussed, and shown to have a good performance on simulated and real world network data. The model is able to correctly recover the underlying communities living in a one-dimensional manifold, even when the parametric form of the underlying curves is unknown, achieving remarkable results on a variety of real data.
    Parallel sequential Monte Carlo for stochastic gradient-free nonconvex optimization. (arXiv:1811.09469v4 [stat.CO] UPDATED)
    (2 min) We introduce and analyze a parallel sequential Monte Carlo methodology for the numerical solution of optimization problems that involve the minimization of a cost function that consists of the sum of many individual components. The proposed scheme is a stochastic zeroth order optimization algorithm which demands only the capability to evaluate small subsets of components of the cost function. It can be depicted as a bank of samplers that generate particle approximations of several sequences of probability measures. These measures are constructed in such a way that they have associated probability density functions whose global maxima coincide with the global minima of the original cost function. The algorithm selects the best performing sampler and uses it to approximate a global minimum of the cost function. We prove analytically that the resulting estimator converges to a global minimum of the cost function almost surely and provide explicit convergence rates in terms of the number of generated Monte Carlo samples and the dimension of the search space. We show, by way of numerical examples, that the algorithm can tackle cost functions with multiple minima or with broad "flat" regions which are hard to minimize using gradient-based techniques.
    Neural network training under semidefinite constraints. (arXiv:2201.00632v1 [cs.LG])
    (2 min) This paper is concerned with the training of neural networks (NNs) under semidefinite constraints. This type of training problems has recently gained popularity since semidefinite constraints can be used to verify interesting properties for NNs that include, e.g., the estimation of an upper bound on the Lipschitz constant, which relates to the robustness of an NN, or the stability of dynamic systems with NN controllers. The utilized semidefinite constraints are based on sector constraints satisfied by the underlying activation functions. Unfortunately, one of the biggest bottlenecks of these new results is the required computational effort for incorporating the semidefinite constraints into the training of NNs which is limiting their scalability to large NNs. We address this challenge by developing interior point methods for NN training that we implement using barrier functions for semidefinite constraints. In order to efficiently compute the gradients of the barrier terms, we exploit the structure of the semidefinite constraints. In experiments, we demonstrate the superior efficiency of our training method over previous approaches, which allows us, e.g., to use semidefinite constraints in the training of Wasserstein generative adversarial networks, where the discriminator must satisfy a Lipschitz condition.
    Deep Feature Fusion for Mitosis Counting. (arXiv:2002.03781v2 [cs.CV] UPDATED)
    (2 min) Each woman living in the United States has about 1 in 8 chance of developing invasive breast cancer. The mitotic cell count is one of the most common tests to assess the aggressiveness or grade of breast cancer. In this prognosis, histopathology images must be examined by a pathologist using high-resolution microscopes to count the cells. Unfortunately, can be an exhaustive task with poor reproducibility, especially for non-experts. Deep learning networks have recently been adapted to medical applications which are able to automatically localize these regions of interest. However, these region-based networks lack the ability to take advantage of the segmentation features produced by a full image CNN which are often used as a sole method of detection. Therefore, the proposed method leverages Faster RCNN for object detection while fusing segmentation features generated by a UNet with RGB image features to achieve an F-score of 0.508 on the MITOS-ATYPIA 2014 mitosis counting challenge dataset, outperforming state-of-the-art methods.
    Improving Actor-Critic Reinforcement Learning via Hamiltonian Monte Carlo Method. (arXiv:2103.12020v3 [cs.LG] UPDATED)
    (2 min) The actor-critic RL is widely used in various robotic control tasks. By viewing the actor-critic RL from the perspective of variational inference (VI), the policy network is trained to obtain the approximate posterior of actions given the optimality criteria. However, in practice, the actor-critic RL may yield suboptimal policy estimates due to the amortization gap and insufficient exploration. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate the policy network of actor-critic RL with HMC, which is termed as {\it Hamiltonian Policy}. As such we propose to evolve actions from the base policy according to HMC, and our proposed method has many benefits. First, HMC can improve the policy distribution to better approximate the posterior and hence reduce the amortization gap. Second, HMC can also guide the exploration more to the regions of action spaces with higher Q values, enhancing the exploration efficiency. Further, instead of directly applying HMC into RL, we propose a new leapfrog operator to simulate the Hamiltonian dynamics. Finally, in safe RL problems, we find that the proposed method can not only improve the achieved return, but also reduce safety constraint violations by discarding potentially unsafe actions. With comprehensive empirical experiments on continuous control baselines, including MuJoCo and PyBullet Roboschool, we show that the proposed approach is a data-efficient and easy-to-implement improvement over previous actor-critic methods.
    Latent Gaussian Model Boosting. (arXiv:2105.08966v3 [cs.LG] UPDATED)
    (2 min) Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.
    Thinking inside the box: A tutorial on grey-box Bayesian optimization. (arXiv:2201.00272v1 [cs.LG])
    (2 min) Bayesian optimization (BO) is a framework for global optimization of expensive-to-evaluate objective functions. Classical BO methods assume that the objective function is a black box. However, internal information about objective function computation is often available. For example, when optimizing a manufacturing line's throughput with simulation, we observe the number of parts waiting at each workstation, in addition to the overall throughput. Recent BO methods leverage such internal information to dramatically improve performance. We call these "grey-box" BO methods because they treat objective computation as partially observable and even modifiable, blending the black-box approach with so-called "white-box" first-principles knowledge of objective function computation. This tutorial describes these methods, focusing on BO of composite objective functions, where one can observe and selectively evaluate individual constituents that feed into the overall objective; and multi-fidelity BO, where one can evaluate cheaper approximations of the objective function by varying parameters of the evaluation oracle.
    Deep-learning-based upscaling method for geologic models via theory-guided convolutional neural network. (arXiv:2201.00698v1 [cs.LG])
    (2 min) Large-scale or high-resolution geologic models usually comprise a huge number of grid blocks, which can be computationally demanding and time-consuming to solve with numerical simulators. Therefore, it is advantageous to upscale geologic models (e.g., hydraulic conductivity) from fine-scale (high-resolution grids) to coarse-scale systems. Numerical upscaling methods have been proven to be effective and robust for coarsening geologic models, but their efficiency remains to be improved. In this work, a deep-learning-based method is proposed to upscale the fine-scale geologic models, which can assist to improve upscaling efficiency significantly. In the deep learning method, a deep convolutional neural network (CNN) is trained to approximate the relationship between the coarse grid of hydraulic conductivity fields and the hydraulic heads, which can then be utilized to replace the numerical solvers while solving the flow equations for each coarse block. In addition, physical laws (e.g., governing equations and periodic boundary conditions) can also be incorporated into the training process of the deep CNN model, which is termed the theory-guided convolutional neural network (TgCNN). With the physical information considered, dependence on the data volume of training the deep learning models can be reduced greatly. Several subsurface flow cases are introduced to test the performance of the proposed deep-learning-based upscaling method, including 2D and 3D cases, and isotropic and anisotropic cases. The results show that the deep learning method can provide equivalent upscaling accuracy to the numerical method, and efficiency can be improved significantly compared to numerical upscaling.
    Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning. (arXiv:2201.00236v1 [cs.LG])
    (2 min) Reinforcement learning (RL) has drawn increasing interests in recent years due to its tremendous success in various applications. However, standard RL algorithms can only be applied for single reward function, and cannot adapt to an unseen reward function quickly. In this paper, we advocate a general operator view of reinforcement learning, which enables us to directly approximate the operator that maps from reward function to value function. The benefit of learning the operator is that we can incorporate any new reward function as input and attain its corresponding value function in a zero-shot manner. To approximate this special type of operator, we design a number of novel operator neural network architectures based on its theoretical properties. Our design of operator networks outperform the existing methods and the standard design of general purpose operator network, and we demonstrate the benefit of our operator deep Q-learning framework in several tasks including reward transferring for offline policy evaluation (OPE) and reward transferring for offline policy optimization in a range of tasks.
    Modeling Advection on Directed Graphs using Mat\'ern Gaussian Processes for Traffic Flow. (arXiv:2201.00001v1 [math.NA])
    (2 min) The transport of traffic flow can be modeled by the advection equation. Finite difference and finite volumes methods have been used to numerically solve this hyperbolic equation on a mesh. Advection has also been modeled discretely on directed graphs using the graph advection operator [4, 18]. In this paper, we first show that we can reformulate this graph advection operator as a finite difference scheme. We then propose the Directed Graph Advection Mat\'ern Gaussian Process (DGAMGP) model that incorporates the dynamics of this graph advection operator into the kernel of a trainable Mat\'ern Gaussian Process to effectively model traffic flow and its uncertainty as an advective process on a directed graph.
    Optimal Representations for Covariate Shift. (arXiv:2201.00057v1 [cs.LG])
    (2 min) Machine learning systems often experience a distribution shift between training and testing. In this paper, we introduce a simple variational objective whose optima are exactly the set of all representations on which risk minimizers are guaranteed to be robust to any distribution shift that preserves the Bayes predictor, e.g., covariate shifts. Our objective has two components. First, a representation must remain discriminative for the task, i.e., some predictor must be able to simultaneously minimize the source and target risk. Second, the representation's marginal support needs to be the same across source and target. We make this practical by designing self-supervised learning methods that only use unlabelled data and augmentations to train robust representations. Our objectives achieve state-of-the-art results on DomainBed, and give insights into the robustness of recent methods, such as CLIP.
  • Featured Blog Posts - Data Science Central

    Beginning Transition to Word Press
    (2 min) Data Science Central is transitioning over to a new platform on January 10, 2022.  In order to complete this migration, we will not be accepting any new submissions for publication after today, January 4th, 2022 at 4:00PM EST. Regular writers will be notified by email about logging into the…
    Intelligent Gateways Applications For Greenfield and Brownfield Environments
    (3 min) The Internet of Things has led to the rise of a new era of computing where intelligent applications are continuously monitored. Also, connected devices need to be efficiently controlled, thereby continuously transforming information into knowledge. By intelligently gathering and analyzing huge amounts of data, smart systems can…
    Can OTT platforms succeed with machine learning services? An insight
    (5 min) Whether you're acquainted with devices such as the Amazon Fire Stick, Chromecast, Spotify, Youtube, or SlingTV, you presumably already have a general understanding of what over-the-top (OTT) television is. Numerous…
    Trends Towards 2022
    (25 min) It's the last week of the year. The gifts have been opened (well, the ones that aren't currently still sitting in a dock in Los Angeles after being ordered in November), the cats have been teaching tree ornaments the meaning of the word gravity, and the cookies which tasted so good on Christmas Eve are getting more than a bit stale. In short, it's…
  • The Official NVIDIA Blog

    Teamwork Makes AVs Work: NVIDIA and Deloitte Deliver Turnkey Solutions for AV Developers
    (4 min) Autonomous vehicles are born in the data center, which is why NVIDIA and Deloitte are delivering a strong foundation for developers to deploy robust self-driving technology. At CES this week, the companies detailed their collaboration, which is aimed at easing the biggest pain points in AV development. Deloitte, a leading global consulting firm, is pairing Read article > The post Teamwork Makes AVs Work: NVIDIA and Deloitte Deliver Turnkey Solutions for AV Developers appeared first on The Official NVIDIA Blog.
    ‘AI Dungeon’ Creator Nick Walton Uses AI to Generate Infinite Gaming Storylines
    (3 min) What started as Nick Walton’s college hackathon project grew into AI Dungeon, a popular text adventure game with over 1.5 million users. Walton is the co-founder and CEO of Latitude, a Utah-based startup that uses AI to create unique gaming storylines. He spoke with NVIDIA AI Podcast host Noah Kravitz about how natural language processing Read article > The post ‘AI Dungeon’ Creator Nick Walton Uses AI to Generate Infinite Gaming Storylines appeared first on The Official NVIDIA Blog.
  • AWS Machine Learning Blog

    How Wix empowers customer care with AI capabilities using Amazon Transcribe
    (7 min) With over 200 million users worldwide, Wix is a leading cloud-based development platform for building fully personalized, high-quality websites. Wix makes it easy for anyone to create a beautiful and professional web presence. When Wix started, it was easy to understand user sentiment and identify product improvement opportunities because the user base was small. Such […]
    How to approach conversation design with Amazon Lex: Building and testing (Part 3)
    (11 min) In parts one and two of our guide to conversation design with Amazon Lex, we discussed how to gather requirements for your conversational AI application and draft conversational flows. In this post, we help you bring all the pieces together. You’ll learn how draft an interaction model to deliver natural conversational experiences, and how to […]
    Deploying ML models using SageMaker Serverless Inference (Preview)
    (7 min) Amazon SageMaker Serverless Inference (Preview) was recently announced at re:Invent 2021 as a new model hosting feature that lets customers serve model predictions without having to explicitly provision compute instances or configure scaling policies to handle traffic variations. Serverless Inference is a new deployment capability that complements SageMaker’s existing options for deployment that include: SageMaker […]
    Build and visualize a real-time fraud prevention system using Amazon Fraud Detector
    (12 min) We’re living in a world of everything-as-an-online-service. Service providers from almost every industry are in the race to feature the best user experience for their online channels like web portals and mobile applications. This raises a new challenge. How do we stop illegal and fraudulent behaviors without impacting typical legitimate interactions? This challenge is even […]
  • John D. Cook

    One diagram, two completely different meanings
    (1 min) I was thumbing through a new book on causal inference, The Effect by Nick Huntington-Klein, and the following diagram caught my eye. Then it made my head hurt. It looks like a category theory diagram. What’s that doing in a book on causal inference? And if it is a category theory diagram, something’s wrong. Either […] One diagram, two completely different meanings first appeared on John D. Cook.
    Memorable techniques
    (2 min) My favorite part of this post by Ian Miellis the introduction. The article is about shell commands, but the introduction brings up a more general point. … there are thousands of reusable patterns I’ve picked up … Unfortunately, I’ve forgotten about 95% of them. … The point is to reflect on what actually stuck, so […] Memorable techniques first appeared on John D. Cook.
  • Artificial Intelligence

    Log Object
    (1 min) Does anybody know what a log object is? submitted by /u/Theverybest196 [link] [comments]
    Modzy launched on Product Hunt!
    (0 min) The Modzy ModelOps platform accelerates the deployment, integration, and governance of production-ready AI. Supported by a growing community of data scientists and developers, Modzy solves the toughest problem with using AI at scale. Check out our post linked here, where you can sign up to test out our free version! submitted by /u/modzykirsten [link] [comments]
    [R] Yale & IBM Propose KerGNNs: Interpretable GNNs with Graph Kernels That Achieve SOTA-Competitive Performance
    (1 min) A research team from Yale and IBM presents Kernel Graph Neural Networks (KerGNNs), which integrate graph kernels into the message passing process of GNNs in one framework, achieving performance comparable to state-of-the-art methods and significantly improving model interpretability compared with conventional GNNs. Here is a quick read: Yale & IBM Propose KerGNNs: Interpretable GNNs with Graph Kernels That Achieve SOTA-Competitive Performance. The paper KerGNNs: Interpretable Graph Neural Networks with Graph Kernels is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    How do you measure fairness without access to demographic data?
    (2 min) Hi all! I'm working on a paper about measuring algorithmic fairness in cases where you don't have direct access to demographic data (for example, if you want to see whether a lender is discriminating against a particular race but the lender is not collecting/releasing race data of loan applicants). If you have ~10 minutes and work in the ethical AI space, it would be a great help to hear from this community on whether/how often you have faced this issue in practice and what you think should be done to mitigate. Survey link is here: https://cambridge.eu.qualtrics.com/jfe/form/SV_e9czBBKDitlglaC submitted by /u/emmharv [link] [comments]
    Use this new year to start learning something new! Whether it is machine learning or piano, just give it a try for 5 minutes tonight! If it is ML-related, this video might help out, and you can start there!
    (1 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Changing Gender and Race in Image Search Results With Machine Learning
    (0 min) submitted by /u/DaveBowman1975 [link] [comments]
    Phonesites: Launch Pages in Minutes from Your Phone
    (0 min) submitted by /u/belqassim [link] [comments]
    Would it be possible to create an AI which can recognize your voice in the middle of a crowd or a somewhat noisy environment?
    (1 min) I know some software can already recognize specific voices, like Siri. However, Siri in particular can’t recognize it beyond the “Hey Siri” keywords I believe. Is it possible to create an AI which can continuously recognize x-person’s voice in a group of people or a loud environment (loud restaurant, construction nearby, concert, etc…) in real time? [link] [comments]
    Dumb-yet-sentient AI
    (3 min) It seems like most people associate sentient AI with Artificial General Intelligence, or AGI, but why should they be related? Wikipedia defines sentience as the "ability to be aware of feelings and sensations", and this doesn't seem to require any specific level of intelligence, does it? Couldn't we build a sentient AI bot, for example, out of "dumb" neural networks (e.g. relatively small transformers and/or CNN/RNN models), so long as they connect together in a way that involves sensing a world around them (in whatever dimensions that world has, not necessarily the same as ours), using these sensations to construct some sort of internal "state", and identifying that state accurately enough to drive subsequent action--or interaction with the world? submitted by /u/mm_maybe [link] [comments]
  • Machine Learning

    [D] Legal use of functions in Pytorch or Tensorflow
    (1 min) Does anybody know, is it legal to use Dropout or BatchNorm from Pytorch and Tensorflow due to Google's patents of these two functions? Did some library avoided patent infringement in its implementation of those functions? submitted by /u/Nos_per [link] [comments]
    [D] Search engine for time series
    (1 min) Hi, Last year I developed a passenger flow forecasting model. Passenger flows are heavily influenced by lockdown measures and the weather, so I wanted to incorporate features relating to that into my model. Doing so, I encountered various frustrations: ​ Data providers generally focus on a specific domain (e.g. covid, weather, ecommerce). Forecasts however can be influenced by data from many domains. Finding these providers, signing up, and reading documentation is very time consuming. It takes a lot of code just to get the data you want. For example, to get covid data for my province, I first had to call a /list endpoint, then retrieve the province identifier, and then loop over the data endpoint because the maximum range was 2 months. I think the core problem is that API’s (as the …
    [D] (A paper suggests) Most Time Series Anomaly Detection Papers are Wrong
    (1 min) I just stumbled on this very nice paper [a], which will appear in AAAI-22. The title seems much too modest, they show that a random algorithm can achieve apparent SOTA results in this domain. This seems to be a stunning result, that casts doubt on the contribution of dozens of papers. For some reason, the area of Time Series Anomaly Detection seems to be the wild west of dubious papers and sloppy thinking. As an aside, there is a benchmark set of 250 datasets here [b] that can be evaluated in a way that is free of the flaw. (my post title reflects my understanding of the paper, the authors may have a different preferred claim). [a] Towards a Rigorous Evaluation of Time-series Anomaly Detection https://arxiv.org/pdf/2109.05257.pdf [b] www.cs.ucr.edu/\~eamonn/time\_series\_data\_2018/UCR\_TimeSeriesAnomalyDatasets2021.zip submitted by /u/eamonnkeogh [link] [comments]
    [D] Multi-agent Deep Reinforcement learning
    (1 min) Hello! I hope you´re doing well. I am working on a multi-agent system with MADDPG.At time t when an agent asks for a task, the other agents are busy (i.e., the busy agents are those that are still processing a task, they didn´t finish it yet).So with this configuration, in the learning phase, I don´t know how to mask the state of the busy agents when injecting the state and action pair to the critic network. Thank you. submitted by /u/GuavaAgreeable208 [link] [comments]
    [D] VQ-VAE: Are there heuristics for the number of embeddings and the embedding dimension?
    (1 min) Hi r/MachineLearning, does anyone with experience training VQ-VAEs know if there are good rules of thumb for the embedding size? E.g. given data of dimension N, use M embeddings of size P Thanks for any help! submitted by /u/Natural_Profession_8 [link] [comments]
    [D] Preparing for a Comprehensive Exam in ML
    (1 min) I am a PhD student based in Canada and have a comprehensive exam coming up in 4-6 months. This is an exam I have been nervous about since I began my PhD. I am fairly confident about the actual proposal and answering questions related to my field. What concerns me more is fundamental/background question as ML and statistics is so broad. Plus, I am a little on the older side and my memory is a little poor. Have any students here taken a comprehensive exam? If so, what was your experience and how did you prepare? Is reading/making notes from a textbook a good idea? Or is preparing a list of topics and reading extensively about them a better option? submitted by /u/ConfusedNoobie [link] [comments]
    [D] Features of features of samples to features of samples?
    (1 min) I have a mxn matrix, m observations and n variables. Along with that, I have another dataset which are the features of the variables, a nxp matrix. What could be a way to get an mxp matrix (without naive matrix multiplication)? I wish to relate the observations to the features of variable. submitted by /u/l34df4rm3r [link] [comments]
    [D] What is the format of full paper presentations in general ML conferences like IJCAI and AAAI?
    (1 min) Hi all, This is my first conference season, so I am curious about how do author of full papers (not extended abstracts or student-track, not invited speeches) present in conferences like ICML, AAAI and IJCAI? I mean, for instance: are presentations performed in a panel with 3-5 presenters using slides, or are they all presented as posters where authors stay available for a duration of time for the interested readers to show up and discuss? Or something else? submitted by /u/briannaszvenska [link] [comments]
    [D] Can you recommend funny papper like "Single Headed Attention RNN: Stop Thinking With Your Head"
    (3 min) I really enjoyed reading this for a change to the textbook papers. Abstract The leading approaches in language modeling are all obsessed with TV shows of my youth - namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale sil- icon. We opt for the lazy path of old and proven techniques with a fancy crypto1 inspired acronym: the Single Headed Attention RNN (SHA-RNN). The author’s lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly differ- ent acronym and slightly different result. We take a previously strong language model based only on boring LSTMs and get it to within a stone’s throw of a stone’s throw of state-of-the-art byte level language model results on enwik8. This work has undergone no intensive hyperparameter optimization and lived entirely on a commodity desktop machine that made the author’s small stu- dio apartment far too warm in the midst of a San Franciscan summer2 . The final results are achiev- able in plus or minus 24 hours on a single GPU as the author is impatient. The attention mechanism is also readily extended to large contexts with minimal computation. Take that Sesame Street. https://arxiv.org/abs/1911.11423 submitted by /u/Puzzled-Bite-8467 [link] [comments]
    [D] Normalizing flows for distributions with finit support
    (1 min) I need to learn a map from Gaussain distribution to Gamma distribution with some custom parameters. So for both distributions, I can sample and evaluate probability density. The first thing, that came to my mind is using normalizing flow. Most approaches include log target probability density evaluation in the loss function. Obviously, normalizing flow sometimes returns negative values, and this term equals to infinity. "Positivation" functions on top of the NF break bijection properties for some regions of space (if not theoretically, but numerically - defenetely). Does NF approach is inapplicable from the box for such a simple problem or I'm missing something? submitted by /u/likan_blk [link] [comments]
    [D] Methods to create monolingual language model from pretrained multilingual model
    (1 min) Apart from just fine-tuning the pretrained multilingual language model on the target language, is there anything more sophisticated that people are doing? submitted by /u/learning-machinist [link] [comments]
    [P] Blogs on fundamentals of Score-based and Diffusion Probabilistic Models
    (1 min) This two-part blog describes the theoretical fundamentals of Score based models, Diffusion Probabilstic models and their relationship. It is written to be a coherent documentation of the theoretical developements in this new class of generative model. Rigorous mathematical proofs are excluded in order to make it more readable. Sharing it in case anyone finds it useful. Part 1: Score-base models Part 2: Diffusion Probabilistic Models submitted by /u/dasayan05 [link] [comments]
    [D] Pretraining the discriminator of a Least Squares GAN
    (2 min) I am trying to train a GAN to generate human poses in 3D space using the Humans 3.6M dataset. The output of the GAN is thus the 3D coordinates of the human joints. I have been experimenting with vanilla GANs but the output is quite noisy. I am now looking into Least Squares GAN but was wondering if it is a good idea to pretrain the discriminator of a Least Squares GAN since LSGANs address the problem of vanishing gradients and loss saturation? submitted by /u/I_am_a_robot_ [link] [comments]
    [D] Minimum Corpus Size for Word Embedding Extraction
    (1 min) Dear all, I have a smallish ~100MB corpus (of historical text in a non-mainstream language), on which I want to apply word embedding. - Is that enough? Shall I only consider "frequent" words? How frequent? does it help if I do some preprocessing such as stemming etc...? - How do I choose the parameters, especially the embedding dimensionality? - Any libraries recommended? - Are there some language-agnostic, unsupervised ways to evaluate the embeddings? ​ Thanks submitted by /u/ihatebeinganonymous [link] [comments]
    [D] Ideal deep learning library
    (3 min) From researcher perspective: - What do you miss in libraries like PyTorch or TensorFlow? - What could be improved? Some possible examples: - The way how autodiff works - Debugging features - Working with axes, einops - Something that just feel awkward, inconvenient or incomplete I would very much appreciate it if you could share your thoughts on this. submitted by /u/u6785 [link] [comments]
    Predicting future labels where the future values of only some features are known (RNNs, time series)[D]
    (3 min) I've done a fair bit of work with RNNs and Time series data, but I now realize there may be a fundamental gap in my knowledge. I was reading through this and it raised some questions about the different kinds of time forecasting problems. So, this is my summary of how it all works. Please correct me if I’m wrong. Scenario 1: You want to predict the value of some stocks into the future. Let’s say you have k stocks and n days of data. You don’t have features/labels stocks, rather the input is each stocks’ current and previous values, and the output is each stocks’ future values. Your two main options are the ‘auto-regressive’ approach where you predict the values one step at a time and feed them back in, or ‘single shot’ approach where you predict all values a fixed amount of time step…
    [P] Classification with imbalanced datasets question
    (3 min) I've been working on a medical classification project with an imbalanced tabular dataset. I have 3 classes, and each class has 44, 16, and 14 rows of data respectively. When I train a random forest classifier, I see that my model is only predicting the dominant class for all test instances most of the time. How can I get around to this? Also, are there any recommendations you can give me for dealing with imbalanced datasets? Thank you submitted by /u/chazy07 [link] [comments]
    [P] I implemented Conformer: Convolution-augmented Transformer
    (1 min) I implemented Google AI's "Conformer: Convolution-augmented Transformer for Speech Recognition" paper, it achieves the best of both worlds by combining CNNs and transformers to model both local and global dependencies and improves the local inductive bias in Transformers. https://github.com/Rishit-dagli/Conformer submitted by /u/Rishit-dagli [link] [comments]
  • Reinforcement Learning

    Simulation environment to real life. Is the brain of RL still flexible enough to learn in real life env?
    (1 min) I am planning to train TD3/DDPG using a simulation environment, and then continue the learning on to a real-life environment. I hope to reduce the timestep required to converge in real-life environment as it is costly and time-consuming. I am new to RL and I am curious as to: 'Would the algorithm still be flexible enough to continue learning?' I am slightly afraid about how the algorithm is going to think that it finished learning during the simulation environment, but then when it comes to real-life environment, it would not be flexible enough to learn on top of what it has already learned. Is this a trivial concern and is something that I should just let the algorithm learn by itself submitted by /u/KoreaNuclear [link] [comments]
    "Finding General Equilibria in Many-Agent Economic Simulations Using Deep Reinforcement Learning", Curry et al 2022
    (1 min) submitted by /u/gwern [link] [comments]
    Final reward! Help!
    (1 min) Hellooo, I have a question and I´ll be glad if someone could help me. The question is : in episodic tasks, can we work with two rewards, one during the steps of an episode and the other as the final reward at the end of the episode? Thank you! submitted by /u/LeatherCredit7148 [link] [comments]
    Some questions on Deep Reinforcement Learning (DRL)
    (2 min) Hello, I hope you are doing well. I am working on DRL and there´re still some unclear points for me: - How we tune the hyperparameters of the network (is there a method to simplify the task) - How to know that we formulated the best state for the agent? - In the case of collaborative multi-agent system, when an agent selects a task and the other agents are busy, will the reward be 0 for the busy agents or will be the reward of the agent that selects the task? submitted by /u/GuavaAgreeable208 [link] [comments]
    Workshops on AI and RL by Shaastra, IIT Madras
    (1 min) Workshops from Shaastra, IIT Madras about AI and Reinforcement Learning Certificates and recordings will be provided on registering in Shaastra's Website submitted by /u/RowEmbarrassed4756 [link] [comments]
    Real life Reinforcement learning
    (1 min) submitted by /u/odinnotdoit [link] [comments]
    Equivalent framework as Gin for PyTorch
    (0 min) Hi, does any of you know is there is an equivalent of this framework (https://github.com/google/gin-config) for PyTorch? Thanks submitted by /u/No_Possibility_7588 [link] [comments]
    Scalar reward is not enough
    (1 min) Check out this paper which discusses the idea that a scalar reward is not enough to create agi. https://arxiv.org/abs/2112.15422 What are your thoughts on this? submitted by /u/Longjumping-Chart-34 [link] [comments]

2022-01-04

  • Featured Blog Posts - Data Science Central

    DSC Weekly Digest 28 December 2021: An Auld Lang Syne (and Cosyne Too)
    (7 min) The last week of the year has traditionally been about reflections and planning, taking stock of the old (or auld), and preparing for the changes of the new. I've added my own thoughts about what 2022 will…
    Why a Data Science Career Is Worth Pursuing
    (4 min) “There were five exabytes of information created between the dawn of civilization through 2003, but that much information is now created every two days.” ~Eric Schmidt (Executive Chairman at Google) The term data science was first…
    Top Social Media Content Moderation Trends that Will Reign Supreme in 2022
    (4 min) The technique of managing desired information from online platforms such as social media networking sites is known as content moderation. It's also known as social moderation, and it's used to control various forms of content that aren't appropriate for a general audience.…
    Could we Live in a Universe with Fewer than Three Dimensions?
    (6 min) No, I am not a flat-earther. Quite the contrary, the ideas suggested here assume advanced scientific knowledge, such as quantum physics. The famous Cambridge-based physicist Stephen Hawking suggested that we may live in a universe with 11 dimensions. What I discuss here is not conflicting with that statement, as we shall see.  It all started…
    Data science and analytics: the future implications
    (4 min) Data science is a domain that combines data-bound analytical techniques and scientific theory to generate insights for business stakeholders. Its shape, elements, and size allow organizations to optimize operations, identify new business opportunities, and reduce the functional performance of departments such as marketing and sales. Simply put,…
  • The Official NVIDIA Blog

    NVIDIA Builds Isaac AMR Platform to Aid $9 Trillion Logistics Industry
    (4 min) Manufacturing and fulfillment centers are profoundly complex. Whenever new earbuds or socks land at your doorstep in hours or a vehicle rolls off an assembly line, a maze of magic happens with AI-driven logistics. Massive facilities like these are constantly in flux. Robots travel miles of aisles to roll up millions of products to assist Read article > The post NVIDIA Builds Isaac AMR Platform to Aid $9 Trillion Logistics Industry appeared first on The Official NVIDIA Blog.
    Gamers, Creators, Drivers Feel GeForce RTX, NVIDIA AI Everywhere
    (6 min) Putting the power of graphics and AI at the fingertips of more users than ever, NVIDIA announced today new laptops and autonomous vehicles using GeForce RTX and NVIDIA AI platforms and expanded reach for GeForce NOW cloud gaming across Samsung TVs and the AT&T network. A virtual address prior to CES showed next-gen games, new Read article > The post Gamers, Creators, Drivers Feel GeForce RTX, NVIDIA AI Everywhere appeared first on The Official NVIDIA Blog.
    NVIDIA Canvas Updated With New AI Model Delivering 4x Resolution and More Materials
    (2 min) As art evolves and artists’ abilities grow, so must their creative tools. NVIDIA Canvas, the free beta app and part of the NVIDIA Studio suite of creative apps and tools, has brought the real-time painting tool GauGAN to anyone with an NVIDIA RTX GPU. Artists use advanced AI to quickly turn simple brushstrokes into realistic Read article > The post NVIDIA Canvas Updated With New AI Model Delivering 4x Resolution and More Materials appeared first on The Official NVIDIA Blog.
    Groundbreaking Updates to NVIDIA Studio Power the 3D Virtual Worlds of Tomorrow, Today
    (4 min) We’re at the dawn of the next digital frontier. Creativity is fueling new developments in design, innovation and virtual worlds. For the creators driving this future, we’ve built NVIDIA Studio, a fully accelerated platform with high-performance GPUs as the heartbeat for laptops and desktops. This hardware is paired with exclusive NVIDIA RTX-accelerated software optimizations in Read article > The post Groundbreaking Updates to NVIDIA Studio Power the 3D Virtual Worlds of Tomorrow, Today appeared first on The Official NVIDIA Blog.
    NVIDIA Makes Free Version of Omniverse Available to Millions of Individual Creators and Artists Worldwide
    (4 min) Designed to be the foundation that connects virtual worlds, NVIDIA Omniverse is now available to millions of individual NVIDIA Studio creators using GeForce RTX and NVIDIA RTX GPUs. In a special address at CES, NVIDIA also announced new platform developments for Omniverse Machinima and Omniverse Audio2Face, new platform features like Nucleus Cloud and 3D marketplaces, Read article > The post NVIDIA Makes Free Version of Omniverse Available to Millions of Individual Creators and Artists Worldwide appeared first on The Official NVIDIA Blog.
    Autonomous Era Arrives at CES 2022 With NVIDIA DRIVE Hyperion and Omniverse Avatar
    (4 min) CES has long been a showcase on what’s coming down the technology pipeline. This year, NVIDIA is showing the radical innovation happening now. During a special virtual address at the show, Ali Kani, vice president and general manager of Automotive at NVIDIA, detailed the capabilities of DRIVE Hyperion and the many ways the industry is Read article > The post Autonomous Era Arrives at CES 2022 With NVIDIA DRIVE Hyperion and Omniverse Avatar appeared first on The Official NVIDIA Blog.
    GeForce NOW Delivers Legendary GeForce Gaming With More Games on More Networks to More Devices
    (3 min) GeForce NOW is kicking off the new year by bringing more games, more devices and more networks to our cloud gaming ecosystem. The next pair of Electronic Arts games, Battlefield 4 and Battlefield V, is streaming on GeForce NOW. We’re also working closely with a few titans in their respective industries: AT&T and Samsung. AT&T Read article > The post GeForce NOW Delivers Legendary GeForce Gaming With More Games on More Networks to More Devices appeared first on The Official NVIDIA Blog.
  • MIT News - Artificial intelligence

    Meet the 2021-22 Accenture Fellows
    (5 min) The 2021-22 Accenture Fellows are bolstering research and igniting ideas to help transform global business.
  • AWS Machine Learning Blog

    Take advantage of advanced deployment strategies using Amazon SageMaker deployment guardrails
    (11 min) Deployment guardrails in Amazon SageMaker provide a new set of deployment capabilities allowing you to implement advanced deployment strategies that minimize risk when deploying new model versions on SageMaker hosting. Depending on your use case, you can use a variety of deployment strategies to release new model versions. Each of these strategies relies on a […]
    Train graph neural nets for millions of proteins on Amazon SageMaker and Amazon DocumentDB (with MongoDB compatibility)
    (11 min) There are over 180,000 unique proteins with 3D structures determined, with tens of thousands new structures resolved every year. This is only a small fraction of the 200 million known proteins with distinctive sequences. Recent deep learning algorithms such as AlphaFold can accurately predict 3D structures of proteins using their sequences, which help scale the […]
  • Artificial Intelligence

    The Stargate sequence from Stanley Kubrick's 2001: A Space Odyssey, remade using AI, in the style of visionary artist Alex Grey.
    (0 min) submitted by /u/glenniszen [link] [comments]
    Good Object-recognition pre-trained models
    (1 min) I used yolov5 for object recognition and it does a fair job, but it also sees my hand as a bird in a lot of situations and leaves a lot to be desired. I know if they're driving off of similar computer vision models, there's gotta be a better one out there for detecting objects. PS I've been using pytorch for this so if you find a good model please also drop a link on how to get it up and running if you have it. submitted by /u/acraber [link] [comments]
    I gave it the gameplay section of the Geometry Dash wikipedia article, this is what I got
    (1 min) Geometry Dash features an hourglass timer with three sections in the shape of circles and triangles. The timer starts in the upper left corner of the in-game map and ends when it falls into the lower right corner, allowing players to use up their time. When it falls to the bottom center of the map, a demon car appears and must be navigated. The demon is a curved ramp for the player's icon, and every time the player's icon touches a portion of the demon, its ramp is altered. Also, the demon's top speed increases by 7x every time the player touches the bottom center. If the player touches the edge of the demon, then the demon will fall, taking a 5x amount of coins from the player's meter. At the end of the level, the demon car drops into a pit.[5] submitted by /u/Alternative-Ad-3041 [link] [comments]
    EndlessVN open alpha in march 2022
    (0 min) submitted by /u/roblox22y [link] [comments]
    Is there a AI which is able to edit images to make them look drawn?
    (0 min) With small mistakes, and only some different colors? submitted by /u/xXLisa28Xx [link] [comments]
    BadA$$ Kittays - Generated with Cyborg Love
    (0 min) submitted by /u/NeurogenicArtist [link] [comments]
    [R] A Neural Network Solves, Grades & Generates University-Level Mathematics Problems by Program Synthesis
    (1 min) In the new paper A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More, a research team from MIT, Columbia University, Harvard University and University of Waterloo proposes a neural network that can solve university-level mathematics problems via program synthesis. Here is a quick read: A Neural Network Solves, Grades & Generates University-Level Mathematics Problems by Program Synthesis. The paper A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Top AI Trends of 2022
    (0 min) submitted by /u/Beautiful-Credit-868 [link] [comments]
    I put the word 'death' in a text to image AI and this is what I got...
    (1 min) submitted by /u/smartpug967 [link] [comments]
    Researchers Propose A Novel Parameter Differentiation-Based Method That Can Automatically Determine Which Parameters Should Be Shared And Which Ones Should Be Language-Specific
    (1 min) In recent years, neural machine translation (NMT) has attracted a lot of attention and has had a lot of success. While traditional NMT is capable of translating a single language pair, training a separate model for each language pair is time-consuming, especially given the world’s thousands of languages. As a result, multilingual NMT is designed to handle many language pairs in a single model, lowering the cost of offline training and online deployment significantly. Furthermore, parameter sharing in multilingual neural machine translation promotes positive knowledge transfer between languages and is advantageous for low-resource translation. Despite the advantages of cooperative training with a completely shared model, the MNMT approach has a model capacity problem. The shared parameters are more likely to preserve broad knowledge while ignoring language-specific knowledge. To improve the model capacity, researchers use heuristic design to create extra language-specific components and build a Multilingual neural machine translation (MNMT) model with a mix of shared and language-specific characteristics, such as the language-specific attention, lightweight language adaptor, or language-specific routing layer. Continue Reading Paper: https://arxiv.org/pdf/2112.13619v1.pdf Github: https://github.com/voidmagic/parameter-differentiation submitted by /u/ai-lover [link] [comments]
  • Machine Learning

    [P] ray-skorch - distributed PyTorch on Ray with sklearn API
    (1 min) tl;dr: train PyTorch models on large tabular datasets with a scikit-learn (skorch) API Hi r/MachineLearning, I'm the principal author of ray-skorch, a library that lets you run distributed PyTorch training on large-scale datasets while providing a familiar, scikit-learn compatible skorch API, integrating well with the rest of the scikit-learn ecosystem. Under the hood, ray-skorch uses Ray Train for distributed PyTorch training and Ray Data for handling and shuffling large datasets. ray-skorch works only with tabular data. Currently, it can use numpy arrays, pandas dataframes and Ray Data Datasets. pip install ray-skorch You can switch your skorch code to ray-skorch just by changing a few lines: import numpy as np from sklearn.datasets import make_classification from torch import nn # pip install pytorch_tabnet from pytorch_tabnet.tab_network import TabNet from ray_skorch import RayTrainNeuralNet X, y = make_classification(1000, 20, n_informative=10, random_state=0) X = X.astype(np.float32) y = y.astype(np.int64) net = RayTrainNeuralNet( TabNet, num_workers=2, # the only new mandatory argument criterion=nn.CrossEntropyLoss, max_epochs=10, lr=0.1, # TabNet specific arguments module__input_dim=20, module__output_dim=2, # required for classification loss funcs iterator_train__unsqueeze_label_tensor=False, iterator_valid__unsqueeze_label_tensor=False, ) net.fit(X, y) # predict_proba returns a ray.data.Dataset y_proba = net.predict_proba(X).to_pandas() More examples, including ones on bigger datasets, can be found here - https://github.com/Yard1/ray-skorch/tree/main/examples The package is experimental, and I’d love to hear your feedback - both on the package itself and on the concept of distributed training on tabular data with simple, familiar APIs. Any comments, suggestions or bug reports are hugely appreciated! submitted by /u/Yard1PL [link] [comments]
    [D] Deep Learning is the future of gaming.
    (1 min) Hey everybody --- I know this isn't hard core AI research but I have been thinking a lot about deep learning and gaming recently and put together a little presentation on how I see things unfolding. Lots of cool research featured in the video. https://www.youtube.com/watch?v=JDL8rZzYVwQ I go over: Photorealistic neural rendering Deepfakes for gaming (https://www.youtube.com/watch?v=RR7u11ANDWE is a better example than the obama one I used) GAN theft auto and dreaming up game engines with neural networks Large language models for building realistic NPCs and storytelling Using OpenAI Codex to automatically program games. It's really clear that deep learning is the most important technology to impact gaming since the advent of 3D graphics. Would love to talk with anybody who is working on stuff in this space. submitted by /u/sabalaba [link] [comments]
    [D] Style Transfer with Noise Vector
    (1 min) Hi everyone, I'm looking for a model which can perform style transfer, but also takes an auxiliary noise vector similar to that for StyleGAN to generate many stylized images for a single input image. Is anyone aware of any model meeting these requirements? My best idea so far is to first embed the image into the StyleGAN latent space with this paper, and then add noise to that vector. submitted by /u/Sebass13 [link] [comments]
    [D] Interpolation, Extrapolation and Linearisation (Prof. Yann LeCun, Dr. Randall Balestriero)
    (1 min) Special machine learning street talk episode! Yann LeCun thinks that it's specious to say neural network models are interpolating because in high dimensions, everything is extrapolation. Recently Dr. Randall Balestriero, Dr. Jerome Pesente and prof. Yann LeCun released their paper learning in high dimensions always amounts to extrapolation. This discussion has completely changed how we think about neural networks and their behaviour. In the intro we talk about the spline theory of NNs, interpolation in NNs and the curse of dimensionality. YT: https://youtu.be/86ib0sfdFtw Pod: https://anchor.fm/machinelearningstreettalk/episodes/061-Interpolation--Extrapolation-and-Linearisation-Prof--Yann-LeCun--Dr--Randall-Balestriero-e1cgdr0 References: Learning in High Dimension Always Amounts to Extrapolation [Randall Balestriero, Jerome Pesenti, Yann LeCun] https://arxiv.org/abs/2110.09485 A Spline Theory of Deep Learning [Dr. Balestriero, baraniuk] https://proceedings.mlr.press/v80/balestriero18b.html Neural Decision Trees [Dr. Balestriero] https://arxiv.org/pdf/1702.07360.pdf Interpolation of Sparse High-Dimensional Data [Dr. Thomas Lux] https://tchlux.github.io/papers/tchlux-2020-NUMA.pdf submitted by /u/timscarfe [link] [comments]
    [D] Neural Networks using a generic GPU framework
    (2 min) I have a (personal) ML project that uses CNNs but I have two little problems: 1. not everyone has a NVidia GPU at home (myself included, sadly); 2. The CNN needs to be trained every time it is used (it's photo to photo style transfer). So, what would be a good framework to implement the CNN for training (targeting desktop only)? I thought about using OpenGL, but I don't know if using GLSL shaders would be a good fit for it. submitted by /u/crimsom_king [link] [comments]
    [D] What are interviews usually like for ML positions?
    (1 min) For context, I'm applying for PhD level positions. Should I expect technical interviews including coding challenges similar to SWE? Any advice on prepping? submitted by /u/_Dark_Forest [link] [comments]
    Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI
    (0 min) submitted by /u/pit_station [link] [comments]
    [N] Launching DagsHub 2.0 – Git-integrated data labeling and smart ML discussions
    (2 min) TL;DR – DagsHub is integrated with Label Studio, and you can now open datasets from Git and DVC remotes, label them and commit labels back, without doing any DevOps. You can also comment on labels, bounding boxes, or any file. Check out the example project, or try out the tutorial. Comparing annotations Hi r/ML! I'm one of the creators of DagsHub (https://www.dagshub.com). We help ML practitioners create a central repository for their projects, where they can leverage open-source tools to version datasets and models, track experiments, and starting today – label data, and comment on anything. Like GitHub for machine learning (you probably heard that before, but we mean it). Our vision is that anyone could jump into an open-source data science project and contribute code, data, labeling,…
    [D] Why is VAE used instead of AutoEncoder in the World Models paper (https://arxiv.org/pdf/1803.10122.pdf)?
    (2 min) Hi All, I was just reading this paper and was wondering if we just want to achieve a compact version of the original representation we could just use a traditional AutoEncoder. Is there any specific reason the VAE is used? Thanks! submitted by /u/StageTraditional636 [link] [comments]
    [R] Play against an AI to detect fake audio
    (1 min) Hi everybody, i'm a PhD student interested in audio spoofs (voice recordings faked with the help of AI), and have developed an online game: You play against an artificial intelligence and try to distinguish spoofed from real audio recordings. It's fun, and very much supports my research. All partificpation (i.e. playing the game), comments or suggestions are welcome! https://deepfake-demo.aisec.fraunhofer.de/ submitted by /u/mummni [link] [comments]
    [D] What are the reviewers' score of the submissions nominated for best paper award in top ML conferences such as NeurIPS, ICML, AISTATS, etc.?
    (1 min) I submitted a paper to AISTATS 2022 that can be a breakthrough with outstanding contributions. The paper received 876 scores from reviewers that could only improve to 877 after the rebuttals. What are the chances that our submission enters the short list for best paper recognition? What are the average reviewers' score of the ones getting nominated in these top conferences? submitted by /u/Cyrus_the-great [link] [comments]
    [D]who knows the paper address of the code?
    (0 min) https://github.com/yuzisheng/trajectory-compress especially the Spatio-Temporal Curvature Streaming submitted by /u/choayue [link] [comments]
    [P] Sieve: We processed ~24 hours of security footage in <10 mins (now semantically searchable per-frame!)
    (6 min) Hey everyone! I’m one of the creators of Sieve, and I’m excited to be sharing it! Sieve is an API that helps you store, process, and automatically search your video data–instantly and efficiently. Just think 10 cameras recording footage at 30 FPS, 24/7. That would be 27 million frames generated in a single day. The videos might be searchable by timestamp, but finding moments of interest is like searching for a needle in a haystack. We built this visual demo (link here) a little while back which we’d love to get feedback on. It’s ~24 hours of security footage that our API processed in <10 mins and has simple querying and export functionality enabled. We see applications in better understanding what data you have, figuring out which data to send to labeling, sampling datasets for training, and building multiple test sets for models by scenario. To try it on your videos: https://github.com/Sieve-Data/automatic-video-processing Visual dashboard walkthrough: https://youtu.be/_uyjp_HGZl4 https://preview.redd.it/bn8hoqoa1m981.png?width=2540&format=png&auto=webp&s=25fb08037438593291fecf7e50ca58ec1f9bea72 https://preview.redd.it/jwkd7uoa1m981.png?width=2540&format=png&auto=webp&s=e25382b4b09855e5934608754a8b74bdbaf93204 https://preview.redd.it/0dd74toa1m981.png?width=2540&format=png&auto=webp&s=05b7625195947b8f15891a9019070efa3730b336 https://preview.redd.it/alg4ruoa1m981.png?width=2540&format=png&auto=webp&s=f5caad143b0d23f3add08f431d0ada322ae4e84d https://preview.redd.it/8c2pw0pa1m981.png?width=2540&format=png&auto=webp&s=e6438f03e3fc7a00ccdf01c9b7075b9e8752affd submitted by /u/happybirthday290 [link] [comments]
    [D] Paper Summary [Rethinking Segmentation from a Sequence-to-Sequence Perspective with Transfromers]
    (1 min) Hi, I have just published my latest medium article. It is a summary of a scientific paper that aims to eliminate the effect of locality which is one of the limitations of CNNs. In this attempt, researchers tried to reform the image semantic segmentation problem then operate a proposed transformer, and finally, introduce three different decoder architectures. Please read it and give me your feedback. If you find it interesting, you can share it with others who are interested in ML as well. Also, if you find it helpful, you can follow me on medium to be updated on my forthcoming articles.🙂 https://rezayazdanfar.medium.com/26868efacc52 submitted by /u/rezayazdanfar [link] [comments]
    [D] Which tools can be helpful for annotation of videos for action recognition?
    (1 min) There is a team in my university who work on ergonomics. They want to do action recognition on some videos. They approached me for help. I work on images. I don't have idea about videos. I have dataset. I want to annotate key points in each frame. Please tell me which tools can be helpful for annotation of videos? submitted by /u/SAbdusSamad [link] [comments]
  • Reinforcement Learning

    Training with multiple agents.
    (0 min) Does anyone know if any of the open source RL libraries support multi-agent training in which agents can have different actions spaces? submitted by /u/YearPersonal5709 [link] [comments]
    Difference between cooperative games and stochastic games
    (1 min) Some say that cooperative games are a subset of stochastic games. But I don't find how. In cooperative games, all the agents have a team reward and have different states whereas in stochastic games, all the agents receives individual reward and they have the same state in a particular time step. Can someone help me understand the difference between these two games? submitted by /u/j0ker_70 [link] [comments]
    Understanding "Stochastic Value Gradients (SVG)"
    (1 min) Hi, I have a hard time understanding the paper "Learning Continuous Control Policies by Stochastic Value Gradients" (https://arxiv.org/pdf/1510.09142.pdf). So, in my understanding is that the stochastic value gradient could be computed easily on imagined ("planned") data, where we rollout the policy in the learned world model. However, this seems undesirable due to compounding model errors. So far, so good. The above issue is fixed when we use real environment samples, although the noise variables η and ξ required to calculate the stochastic value gradient are then unknown. The paper then applies the Bayes rule to reformulate the stochastic value gradient to a tractable formulation. My issue: to compute the stochastic value gradient, the noise variables must be inferred. On this issue, the paper simply states "... infer the missing noise variables, possibly by sampling from p(η,ξ|s, a, s′)". How can we sample these noise variables? How can we infer the noise variables? Any help would be greatly appreciated! submitted by /u/Internal-Brush4929 [link] [comments]
    Training Data
    (1 min) Hi, i'm pretty new to this kind of learning task and i found hard to understand some base concepts. I'm studying the Decision Transformer paper and for an university problem i need to train that model using the MiniWorld environment. While i was looking at the code that is provided by the author of MiniWorld i got a question. How i get training data? The authors of Decision Transformer uses d4rl that provide some offline dataset. How can create training data in a similar way using my environment? submitted by /u/maverik75 [link] [comments]
    Anyone using RL for Trading stocks?
    (2 min) submitted by /u/GarantBM [link] [comments]
    How to make gym a parallel environment?
    (1 min) I'm run gym environment CartPole-v0, but my GPU usage is low. I get a resolution that I can use N same policy Networks to get actions for N envs. I tried agymc, but it can't satisfy my need. Because it can only copy the envs, and cannot use N Networks. neural.py (github.com) So I write a code as above, but I get this error as below, how can I repair the bug: Exception in thread Thread-7: File "C:\Users\afuler\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner self.run() File "C:\Users\afuler\AppData\Local\Programs\Python\Python39\lib\threading.py", line 910, in run self._target(*self._args, **self._kwargs) File "d:\workspace\rl\nonnueral\nonnueralrl.py", line 271, in predict envList[worker_num].render() File "C:\Users\afuler\AppData\Local\Pro…
    PPO convergence when moving to continuous action space
    (1 min) Hello :D, I had a PPO implementation with convolutional neural networks that worked just fine with discrete action space (environment is a grid world where an agent has 4 possible actions=directions). I took the same implementation and switched to continuous action space (2 actions = force over x and y axis). The model does pretty well in the first 7mil steps (reward converges), but then the reward suddenly goes down. Any idea what could be the reason? could it be the decaying variance I am using for the continuous action selection? submitted by /u/AhmedNizam_ [link] [comments]

2022-01-03

  • cs.LG updates on arXiv.org

    CONFAIR: Configurable and Interpretable Algorithmic Fairness. (arXiv:2111.08878v3 [cs.LG] UPDATED)
    (2 min) The rapid growth of data in the recent years has led to the development of complex learning algorithms that are often used to make decisions in real world. While the positive impact of the algorithms has been tremendous, there is a need to mitigate any bias arising from either training samples or implicit assumptions made about the data samples. This need becomes critical when algorithms are used in automated decision making systems that can hugely impact people's lives. Many approaches have been proposed to make learning algorithms fair by detecting and mitigating bias in different stages of optimization. However, due to a lack of a universal definition of fairness, these algorithms optimize for a particular interpretation of fairness which makes them limited for real world use. Moreover, an underlying assumption that is common to all algorithms is the apparent equivalence of achieving fairness and removing bias. In other words, there is no user defined criteria that can be incorporated into the optimization procedure for producing a fair algorithm. Motivated by these shortcomings of existing methods, we propose the CONFAIR procedure that produces a fair algorithm by incorporating user constraints into the optimization procedure. Furthermore, we make the process interpretable by estimating the most predictive features from data. We demonstrate the efficacy of our approach on several real world datasets using different fairness criteria.
    ML-Decoder: Scalable and Versatile Classification Head. (arXiv:2111.12933v2 [cs.CV] UPDATED)
    (2 min) In this paper, we introduce ML-Decoder, a new attention-based classification head. ML-Decoder predicts the existence of class labels via queries, and enables better utilization of spatial data compared to global average pooling. By redesigning the decoder architecture, and using a novel group-decoding scheme, ML-Decoder is highly efficient, and can scale well to thousands of classes. Compared to using a larger backbone, ML-Decoder consistently provides a better speed-accuracy trade-off. ML-Decoder is also versatile - it can be used as a drop-in replacement for various classification heads, and generalize to unseen classes when operated with word queries. Novel query augmentations further improve its generalization ability. Using ML-Decoder, we achieve state-of-the-art results on several classification tasks: on MS-COCO multi-label, we reach 91.4% mAP; on NUS-WIDE zero-shot, we reach 31.1% ZSL mAP; and on ImageNet single-label, we reach with vanilla ResNet50 backbone a new top score of 80.7%, without extra data or distillation. Public code is available at: https://github.com/Alibaba-MIIL/ML_Decoder
    Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification. (arXiv:2103.12656v2 [cs.LG] UPDATED)
    (2 min) Reinforcement learning (RL) algorithms assume that users specify tasks by manually writing down a reward function. However, this process can be laborious and demands considerable technical expertise. Can we devise RL algorithms that instead enable users to specify tasks simply by providing examples of successful outcomes? In this paper, we derive a control algorithm that maximizes the future probability of these successful outcome examples. Prior work has approached similar problems with a two-stage process, first learning a reward function and then optimizing this reward function using another RL algorithm. In contrast, our method directly learns a value function from transitions and successful outcomes, without learning this intermediate reward function. Our method therefore requires fewer hyperparameters to tune and lines of code to debug. We show that our method satisfies a new data-driven Bellman equation, where examples take the place of the typical reward function term. Experiments show that our approach outperforms prior methods that learn explicit reward functions.
    A Survey of Embodied AI: From Simulators to Research Tasks. (arXiv:2103.04918v6 [cs.AI] UPDATED)
    (2 min) There has been an emerging paradigm shift from the era of "internet AI" to "embodied AI", where AI algorithms and agents no longer learn from datasets of images, videos or text curated primarily from the internet. Instead, they learn through interactions with their environments from an egocentric perception similar to humans. Consequently, there has been substantial growth in the demand for embodied AI simulators to support various embodied AI research tasks. This growing interest in embodied AI is beneficial to the greater pursuit of Artificial General Intelligence (AGI), but there has not been a contemporary and comprehensive survey of this field. This paper aims to provide an encyclopedic survey for the field of embodied AI, from its simulators to its research. By evaluating nine current embodied AI simulators with our proposed seven features, this paper aims to understand the simulators in their provision for use in embodied AI research and their limitations. Lastly, this paper surveys the three main research tasks in embodied AI -- visual exploration, visual navigation and embodied question answering (QA), covering the state-of-the-art approaches, evaluation metrics and datasets. Finally, with the new insights revealed through surveying the field, the paper will provide suggestions for simulator-for-task selections and recommendations for the future directions of the field.
    A GAN-Like Approach for Physics-Based Imitation Learning and Interactive Character Control. (arXiv:2105.10066v4 [cs.GR] UPDATED)
    (2 min) We present a simple and intuitive approach for interactive control of physically simulated characters. Our work builds upon generative adversarial networks (GAN) and reinforcement learning, and introduces an imitation learning framework where an ensemble of classifiers and an imitation policy are trained in tandem given pre-processed reference clips. The classifiers are trained to discriminate the reference motion from the motion generated by the imitation policy, while the policy is rewarded for fooling the discriminators. Using our GAN-based approach, multiple motor control policies can be trained separately to imitate different behaviors. In runtime, our system can respond to external control signal provided by the user and interactively switch between different policies. Compared to existing methods, our proposed approach has the following attractive properties: 1) achieves state-of-the-art imitation performance without manually designing and fine tuning a reward function; 2) directly controls the character without having to track any target reference pose explicitly or implicitly through a phase state; and 3) supports interactive policy switching without requiring any motion generation or motion matching mechanism. We highlight the applicability of our approach in a range of imitation and interactive control tasks, while also demonstrating its ability to withstand external perturbations as well as to recover balance. Overall, our approach generates high-fidelity motion, has low runtime cost, and can be easily integrated into interactive applications and games.
    Analysis of Regularized Learning in Banach Spaces. (arXiv:2109.03159v4 [cs.LG] UPDATED)
    (2 min) This article presents a new way to study the theory of regularized learning for generalized data in Banach spaces including representer theorems and convergence theorems. The generalized data are composed of linear functionals and real scalars as the input and output elements to represent the discrete information of different local models. By the extension of the classical machine learning, the empirical risks are computed by the generalized data and the loss functions. According to the techniques of regularization, the exact solutions are approximated globally by minimizing the regularized empirical risks over the Banach spaces. The existence and convergence of the approximate solutions are guaranteed by the relative compactness of the generalized input data in the predual spaces of the Banach spaces.
    Leveraging in-domain supervision for unsupervised image-to-image translation tasks via multi-stream generators. (arXiv:2112.15091v1 [cs.CV])
    (2 min) Supervision for image-to-image translation (I2I) tasks is hard to come by, but bears significant effect on the resulting quality. In this paper, we observe that for many Unsupervised I2I (UI2I) scenarios, one domain is more familiar than the other, and offers in-domain prior knowledge, such as semantic segmentation. We argue that for complex scenes, figuring out the semantic structure of the domain is hard, especially with no supervision, but is an important part of a successful I2I operation. We hence introduce two techniques to incorporate this invaluable in-domain prior knowledge for the benefit of translation quality: through a novel Multi-Stream generator architecture, and through a semantic segmentation-based regularization loss term. In essence, we propose splitting the input data according to semantic masks, explicitly guiding the network to different behavior for the different regions of the image. In addition, we propose training a semantic segmentation network along with the translation task, and to leverage this output as a loss term that improves robustness. We validate our approach on urban data, demonstrating superior quality in the challenging UI2I tasks of converting day images to night ones. In addition, we also demonstrate how reinforcing the target dataset with our augmented images improves the training of downstream tasks such as the classical detection one.
    When are Iterative Gaussian Processes Reliably Accurate?. (arXiv:2112.15246v1 [cs.LG])
    (2 min) While recent work on conjugate gradient methods and Lanczos decompositions have achieved scalable Gaussian process inference with highly accurate point predictions, in several implementations these iterative methods appear to struggle with numerical instabilities in learning kernel hyperparameters, and poor test likelihoods. By investigating CG tolerance, preconditioner rank, and Lanczos decomposition rank, we provide a particularly simple prescription to correct these issues: we recommend that one should use a small CG tolerance ($\epsilon \leq 0.01$) and a large root decomposition size ($r \geq 5000$). Moreover, we show that L-BFGS-B is a compelling optimizer for Iterative GPs, achieving convergence with fewer gradient updates.
    Processing Images from Multiple IACTs in the TAIGA Experiment with Convolutional Neural Networks. (arXiv:2112.15382v1 [astro-ph.IM])
    (2 min) Extensive air showers created by high-energy particles interacting with the Earth atmosphere can be detected using imaging atmospheric Cherenkov telescopes (IACTs). The IACT images can be analyzed to distinguish between the events caused by gamma rays and by hadrons and to infer the parameters of the event such as the energy of the primary particle. We use convolutional neural networks (CNNs) to analyze Monte Carlo-simulated images from the telescopes of the TAIGA experiment. The analysis includes selection of the images corresponding to the showers caused by gamma rays and estimating the energy of the gamma rays. We compare performance of the CNNs using images from a single telescope and the CNNs using images from two telescopes as inputs.
    A Survey on Deep learning based Document Image Enhancement. (arXiv:2112.02719v3 [cs.CV] UPDATED)
    (2 min) Digitized documents such as scientific articles, tax forms, invoices, contract papers, historic texts are widely used nowadays. These document images could be degraded or damaged due to various reasons including poor lighting conditions, shadow, distortions like noise and blur, aging, ink stain, bleed-through, watermark, stamp, etc. Document image enhancement plays a crucial role as a pre-processing step in many automated document analysis and recognition tasks such as character recognition. With recent advances in deep learning, many methods are proposed to enhance the quality of these document images. In this paper, we review deep learning-based methods, datasets, and metrics for six main document image enhancement tasks, including binarization, debluring, denoising, defading, watermark removal, and shadow removal. We summarize the recent works for each task and discuss their features, challenges, and limitations. We introduce multiple document image enhancement tasks that have received little to no attention, including over and under exposure correction, super resolution, and bleed-through removal. We identify several promising research directions and opportunities for future research.
    A toolkit for data-driven discovery of governing equations in high-noise regimes. (arXiv:2111.04870v2 [cs.LG] UPDATED)
    (2 min) We consider the data-driven discovery of governing equations from time-series data in the limit of high noise. The algorithms developed describe an extensive toolkit of methods for circumventing the deleterious effects of noise in the context of the sparse identification of nonlinear dynamics (SINDy) framework. We offer two primary contributions, both focused on noisy data acquired from a system x' = f(x). First, we propose, for use in high-noise settings, an extensive toolkit of critically enabling extensions for the SINDy regression method, to progressively cull functionals from an over-complete library and yield a set of sparse equations that regress to the derivate x'. These innovations can extract sparse governing equations and coefficients from high-noise time-series data (e.g. 300% added noise). For example, it discovers the correct sparse libraries in the Lorenz system, with median coefficient estimate errors equal to 1% - 3% (for 50% noise), 6% - 8% (for 100% noise); and 23% - 25% (for 300% noise). The enabling modules in the toolkit are combined into a single method, but the individual modules can be tactically applied in other equation discovery methods (SINDy or not) to improve results on high-noise data. Second, we propose a technique, applicable to any model discovery method based on x' = f(x), to assess the accuracy of a discovered model in the context of non-unique solutions due to noisy data. Currently, this non-uniqueness can obscure a discovered model's accuracy and thus a discovery method's effectiveness. We describe a technique that uses linear dependencies among functionals to transform a discovered model into an equivalent form that is closest to the true model, enabling more accurate assessment of a discovered model's accuracy.
    Inferring perceptual decision making parameters from behavior in production and reproduction tasks. (arXiv:2112.15521v1 [cs.LG])
    (2 min) Bayesian models of behavior have provided computational level explanations in a range of psychophysical tasks. One fundamental experimental paradigm is the production or reproduction task, in which subjects are instructed to generate an action that either reproduces a previously sensed stimulus magnitude or achieves a target response. This type of task therefore distinguishes itself from other psychophysical tasks in that the responses are on a continuum and effort plays an important role with increasing response magnitude. Based on Bayesian decision theory we present an inference method to recover perceptual uncertainty, response variability, and the cost function underlying human responses. Crucially, the cost function is parameterized such that effort is explicitly included. We present a hybrid inference method employing MCMC sampling utilizing appropriate proposal distributions and an inner loop utilizing amortized inference with a neural network that approximates the mode of the optimal response distribution. We show how this model can be utilized to avoid unidentifiability of experimental designs and that parameters can be recovered through validation on synthetic and application to experimental data. Our approach will enable behavioral scientists to perform Bayesian inference of decision making parameters in production and reproduction tasks.
    Embedding Information onto a Dynamical System. (arXiv:2105.10766v3 [math.DS] UPDATED)
    (2 min) The celebrated Takens' embedding theorem concerns embedding an attractor of a dynamical system in a Euclidean space of appropriate dimension through a generic delay-observation map. The embedding also establishes a topological conjugacy. In this paper, we show how an arbitrary sequence can be mapped into another space as an attractive solution of a nonautonomous dynamical system. Such mapping also entails a topological conjugacy and an embedding between the sequence and the attractive solution spaces. This result is not a generalization of Takens embedding theorem but helps us understand what exactly is required by discrete-time state space models widely used in applications to embed an external stimulus onto its solution space. Our results settle another basic problem concerning the perturbation of an autonomous dynamical system. We describe what exactly happens to the dynamics when exogenous noise perturbs continuously a local irreducible attracting set (such as a stable fixed point) of a discrete-time autonomous dynamical system.
    Notes on the H-measure of classifier performance. (arXiv:2106.11888v2 [cs.LG] UPDATED)
    (2 min) The H-measure is a classifier performance measure which takes into account the context of application without requiring a rigid value of relative misclassification costs to be set. Since its introduction in 2009 it has become widely adopted. This paper answers various queries which users have raised since its introduction, including questions about its interpretation, the choice of a weighting function, whether it is strictly proper, and its coherence, and relates the measure to other work.
    Frame invariance and scalability of neural operators for partial differential equations. (arXiv:2112.14769v1 [cs.LG])
    (2 min) Partial differential equations (PDEs) play a dominant role in the mathematical modeling of many complex dynamical processes. Solving these PDEs often requires prohibitively high computational costs, especially when multiple evaluations must be made for different parameters or conditions. After training, neural operators can provide PDEs solutions significantly faster than traditional PDE solvers. In this work, invariance properties and computational complexity of two neural operators are examined for transport PDE of a scalar quantity. Neural operator based on graph kernel network (GKN) operates on graph-structured data to incorporate nonlocal dependencies. Here we propose a modified formulation of GKN to achieve frame invariance. Vector cloud neural network (VCNN) is an alternate neural operator with embedded frame invariance which operates on point cloud data. GKN-based neural operator demonstrates slightly better predictive performance compared to VCNN. However, GKN requires an excessively high computational cost that increases quadratically with the increasing number of discretized objects as compared to a linear increase for VCNN.
    Semi-Decentralized Federated Edge Learning for Fast Convergence on Non-IID Data. (arXiv:2104.12678v5 [cs.NI] UPDATED)
    (0 min) Federated edge learning (FEEL) has emerged as an effective approach to reduce the large communication latency in Cloud-based machine learning solutions, while preserving data privacy. Unfortunately, the learning performance of FEEL may be compromised due to limited training data in a single edge cluster. In this paper, we investigate a novel framework of FEEL, namely semi-decentralized federated edge learning (SD-FEEL). By allowing model aggregation across different edge clusters, SD-FEEL enjoys the benefit of FEEL in reducing the training latency, while improving the learning performance by accessing richer training data from multiple edge clusters. A training algorithm for SD-FEEL with three main procedures in each round is presented, including local model updates, intra-cluster and inter-cluster model aggregations, which is proved to converge on non-independent and identically distributed (non-IID) data. We also characterize the interplay between the network topology of the edge servers and the communication overhead of inter-cluster model aggregation on the training performance. Experiment results corroborate our analysis and demonstrate the effectiveness of SD-FFEL in achieving faster convergence than traditional federated learning architectures. Besides, guidelines on choosing critical hyper-parameters of the training algorithm are also provided.
    A Unified Analysis Method for Online Optimization in Normed Vector Space. (arXiv:2112.12134v2 [cs.LG] UPDATED)
    (2 min) We present a unified analysis method that relies on the generalized cosine rule and $\phi$-convex for online optimization in normed vector space using dynamic regret as the performance metric. In combing the update rules, we start with strategy $S$ (a two-parameter variant strategy covering Optimistic-FTRL with surrogate linearized losses), and obtain $S$-I (type-I relaxation variant form of $S$) and $S$-II (type-II relaxation variant form of $S$, which is Optimistic-MD) by relaxation. Regret bounds for $S$-I and $S$-II are the tightest possible. As instantiations, regret bounds of normalized exponentiated subgradient and greedy/lazy projection are better than the currently known optimal results. By replacing losses of online game with monotone operators, and extending the definition of regret, namely regret$^n$, we extend online convex optimization to online monotone optimization, which expands the application scope of $S$-I and $S$-II.
    Sufficient Statistic Memory AMP. (arXiv:2112.15327v1 [cs.IT])
    (2 min) Approximate message passing (AMP) is a promising technique for unknown signal reconstruction of certain high-dimensional linear systems with non-Gaussian signaling. A distinguished feature of the AMP-type algorithms is that their dynamics can be rigorously described by state evolution. However, state evolution does not necessarily guarantee the convergence of iterative algorithms. To solve the convergence problem of AMP-type algorithms in principle, this paper proposes a memory AMP (MAMP) under a sufficient statistic condition, named sufficient statistic MAMP (SS-MAMP). We show that the covariance matrices of SS-MAMP are L-banded and convergent. Given an arbitrary MAMP, we can construct an SS-MAMP by damping, which not only ensures the convergence of MAMP but also preserves the orthogonality of MAMP, i.e., its dynamics can be rigorously described by state evolution. As a byproduct, we prove that the Bayes-optimal orthogonal/vector AMP (BO-OAMP/VAMP) is an SS-MAMP. As a result, we reveal two interesting properties of BO-OAMP/VAMP for large systems: 1) the covariance matrices are L-banded and are convergent in BO-OAMP/VAMP, and 2) damping and memory are useless (i.e., do not bring performance improvement) in BO-OAMP/VAMP. As an example, we construct a sufficient statistic Bayes-optimal MAMP (BO-MAMP), which is Bayes optimal if its state evolution has a unique fixed point and its MSE is not worse than the original BO-MAMP. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    Entropy Regularized Optimal Transport Independence Criterion. (arXiv:2112.15265v1 [stat.ML])
    (0 min) Optimal transport (OT) and its entropy regularized offspring have recently gained a lot of attention in both machine learning and AI domains. In particular, optimal transport has been used to develop probability metrics between probability distributions. We introduce in this paper an independence criterion based on entropy regularized optimal transport. Our criterion can be used to test for independence between two samples. We establish non-asymptotic bounds for our test statistic, and study its statistical behavior under both the null and alternative hypothesis. Our theoretical results involve tools from U-process theory and optimal transport theory. We present experimental results on existing benchmarks, illustrating the interest of the proposed criterion.
    Non Asymptotic Bounds for Optimization via Online Multiplicative Stochastic Gradient Descent. (arXiv:2112.07110v4 [stat.ML] UPDATED)
    (2 min) The gradient noise of Stochastic Gradient Descent (SGD) is considered to play a key role in its properties (e.g. escaping low potential points and regularization). Past research has indicated that the covariance of the SGD error done via minibatching plays a critical role in determining its regularization and escape from low potential points. It is however not much explored how much the distribution of the error influences the behavior of the algorithm. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced by Wu et al., which has a much more general noise class than the SGD algorithm done via minibatching. We establish nonasymptotic bounds for the M-SGD algorithm mainly with respect to the Stochastic Differential Equation corresponding to SGD via minibatching. We also show that the M-SGD error is approximately a scaled Gaussian distribution with mean $0$ at any fixed point of the M-SGD algorithm. We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
    Financial Vision Based Differential Privacy Applications. (arXiv:2112.14075v2 [cs.LG] UPDATED)
    (2 min) The importance of deep learning data privacy has gained significant attention in recent years. It is probably to suffer data breaches when applying deep learning to cryptocurrency that lacks supervision of financial regulatory agencies. However, there is little relative research in the financial area to our best knowledge. We apply two representative deep learning privacy-privacy frameworks proposed by Google to financial trading data. We designed the experiments with several different parameters suggested from the original studies. In addition, we refer the degree of privacy to Google and Apple companies to estimate the results more reasonably. The results show that DP-SGD performs better than the PATE framework in financial trading data. The tradeoff between privacy and accuracy is low in DP-SGD. The degree of privacy also is in line with the actual case. Therefore, we can obtain a strong privacy guarantee with precision to avoid potential financial loss.
    Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. (arXiv:2112.07068v2 [stat.ML] UPDATED)
    (2 min) Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.
    Improved Algorithm for the Network Alignment Problem with Application to Binary Diffing. (arXiv:2112.15336v1 [cs.LG])
    (0 min) In this paper, we present a novel algorithm to address the Network Alignment problem. It is inspired from a previous message passing framework of Bayati et al. [2] and includes several modifications designed to significantly speed up the message updates as well as to enforce their convergence. Experiments show that our proposed model outperforms other state-of-the-art solvers. Finally, we propose an application of our method in order to address the Binary Diffing problem. We show that our solution provides better assignment than the reference differs in almost all submitted instances and outline the importance of leveraging the graphical structure of binary programs.
    Importance of Empirical Sample Complexity Analysis for Offline Reinforcement Learning. (arXiv:2112.15578v1 [cs.LG])
    (2 min) We hypothesize that empirically studying the sample complexity of offline reinforcement learning (RL) is crucial for the practical applications of RL in the real world. Several recent works have demonstrated the ability to learn policies directly from offline data. In this work, we ask the question of the dependency on the number of samples for learning from offline data. Our objective is to emphasize that studying sample complexity for offline RL is important, and is an indicator of the usefulness of existing offline algorithms. We propose an evaluation approach for sample complexity analysis of offline RL.
    QueryNet: Attack by Multi-Identity Surrogates. (arXiv:2105.15010v3 [cs.LG] UPDATED)
    (2 min) Deep Neural Networks (DNNs) are acknowledged as vulnerable to adversarial attacks, while the existing black-box attacks require extensive queries on the victim DNN to achieve high success rates. For query-efficiency, surrogate models of the victim are used to generate transferable Adversarial Examples (AEs) because of their Gradient Similarity (GS), i.e., surrogates' attack gradients are similar to the victim's ones. However, it is generally neglected to exploit their similarity on outputs, namely the Prediction Similarity (PS), to filter out inefficient queries by surrogates without querying the victim. To jointly utilize and also optimize surrogates' GS and PS, we develop QueryNet, a unified attack framework that can significantly reduce queries. QueryNet creatively attacks by multi-identity surrogates, i.e., crafts several AEs for one sample by different surrogates, and also uses surrogates to decide on the most promising AE for the query. After that, the victim's query feedback is accumulated to optimize not only surrogates' parameters but also their architectures, enhancing both the GS and the PS. Although QueryNet has no access to pre-trained surrogates' prior, it reduces queries by averagely about an order of magnitude compared to alternatives within an acceptable time, according to our comprehensive experiments: 11 victims (including two commercial models) on MNIST/CIFAR10/ImageNet, allowing only 8-bit image queries, and no access to the victim's training data. The code is available at https://github.com/AllenChen1998/QueryNet.
    Geometric Deep Learning on Molecular Representations. (arXiv:2107.12375v4 [physics.chem-ph] UPDATED)
    (2 min) Geometric deep learning (GDL), which is based on neural network architectures that incorporate and process symmetry information, has emerged as a recent paradigm in artificial intelligence. GDL bears particular promise in molecular modeling applications, in which various molecular representations with different symmetry properties and levels of abstraction exist. This review provides a structured and harmonized overview of molecular GDL, highlighting its applications in drug discovery, chemical synthesis prediction, and quantum chemistry. Emphasis is placed on the relevance of the learned molecular features and their complementarity to well-established molecular descriptors. This review provides an overview of current challenges and opportunities, and presents a forecast of the future of GDL for molecular sciences.
    Beta-VAE Reproducibility: Challenges and Extensions. (arXiv:2112.14278v2 [cs.LG] UPDATED)
    (2 min) $\beta$-VAE is a follow-up technique to variational autoencoders that proposes special weighting of the KL divergence term in the VAE loss to obtain disentangled representations. Unsupervised learning is known to be brittle even on toy datasets and a meaningful, mathematically precise definition of disentanglement remains difficult to find. Here we investigate the original $\beta$-VAE paper and add evidence to the results previously obtained indicating its lack of reproducibility. We also further expand the experimentation of the models and include further more complex datasets in the analysis. We also implement an FID scoring metric for the $\beta$-VAE model and conclude a qualitative analysis of the results obtained. We end with a brief discussion on possible future investigations that can be conducted to add more robustness to the claims.
    State Selection Algorithms and Their Impact on The Performance of Stateful Network Protocol Fuzzing. (arXiv:2112.15498v1 [cs.SE])
    (2 min) The statefulness property of network protocol implementations poses a unique challenge for testing and verification techniques, including Fuzzing. Stateful fuzzers tackle this challenge by leveraging state models to partition the state space and assist the test generation process. Since not all states are equally important and fuzzing campaigns have time limits, fuzzers need effective state selection algorithms to prioritize progressive states over others. Several state selection algorithms have been proposed but they were implemented and evaluated separately on different platforms, making it hard to achieve conclusive findings. In this work, we evaluate an extensive set of state selection algorithms on the same fuzzing platform that is AFLNet, a state-of-the-art fuzzer for network servers. The algorithm set includes existing ones supported by AFLNet and our novel and principled algorithm called AFLNetLegion. The experimental results on the ProFuzzBench benchmark show that (i) the existing state selection algorithms of AFLNet achieve very similar code coverage, (ii) AFLNetLegion clearly outperforms these algorithms in selected case studies, but (iii) the overall improvement appears insignificant. These are unexpected yet interesting findings. We identify problems and share insights that could open opportunities for future research on this topic.
    Transfer learning of phase transitions in percolation and directed percolation. (arXiv:2112.15516v1 [cond-mat.stat-mech])
    (2 min) The latest advances of statistical physics have shown remarkable performance of machine learning in identifying phase transitions. In this paper, we apply domain adversarial neural network (DANN) based on transfer learning to studying non-equilibrium and equilibrium phase transition models, which are percolation model and directed percolation (DP) model, respectively. With the DANN, only a small fraction of input configurations (2d images) needs to be labeled, which is automatically chosen, in order to capture the critical point. To learn the DP model, the method is refined by an iterative procedure in determining the critical point, which is a prerequisite for the data collapse in calculating the critical exponent $\nu_{\perp}$. We then apply the DANN to a two-dimensional site percolation with configurations filtered to include only the largest cluster which may contain the information related to the order parameter. The DANN learning of both models yields reliable results which are comparable to the ones from Monte Carlo simulations. Our study also shows that the DANN can achieve quite high accuracy at much lower cost, compared to the supervised learning.
    Revisiting Experience Replay: Continual Learning by Adaptively Tuning Task-wise Relationship. (arXiv:2112.15402v1 [cs.LG])
    (2 min) Continual learning requires models to learn new tasks while maintaining previously learned knowledge. Various algorithms have been proposed to address this real challenge. Till now, rehearsal-based methods, such as experience replay, have achieved state-of-the-art performance. These approaches save a small part of the data of the past tasks as a memory buffer to prevent models from forgetting previously learned knowledge. However, most of them treat every new task equally, i.e., fixed the hyperparameters of the framework while learning different new tasks. Such a setting lacks the consideration of the relationship/similarity between past and new tasks. For example, the previous knowledge/features learned from dogs are more beneficial for the identification of cats (new task), compared to those learned from buses. In this regard, we propose a meta learning algorithm based on bi-level optimization to adaptively tune the relationship between the knowledge extracted from the past and new tasks. Therefore, the model can find an appropriate direction of gradient during continual learning and avoid the serious overfitting problem on memory buffer. Extensive experiments are conducted on three publicly available datasets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet). The experimental results demonstrate that the proposed method can consistently improve the performance of all baselines.
    Machine Learning Trivializing Maps: A First Step Towards Understanding How Flow-Based Samplers Scale Up. (arXiv:2112.15532v1 [hep-lat])
    (0 min) A trivializing map is a field transformation whose Jacobian determinant exactly cancels the interaction terms in the action, providing a representation of the theory in terms of a deterministic transformation of a distribution from which sampling is trivial. Recently, a proof-of-principle study by Albergo, Kanwar and Shanahan [arXiv:1904.12072] demonstrated that approximations of trivializing maps can be `machine-learned' by a class of invertible, differentiable neural models called \textit{normalizing flows}. By ensuring that the Jacobian determinant can be computed efficiently, asymptotically exact sampling from the theory of interest can be performed by drawing samples from a simple distribution and passing them through the network. From a theoretical perspective, this approach has the potential to become more efficient than traditional Markov Chain Monte Carlo sampling techniques, where autocorrelations severely diminish the sampling efficiency as one approaches the continuum limit. A major caveat is that it is not yet understood how the size of models and the cost of training them is expected to scale. As a first step, we have conducted an exploratory scaling study using two-dimensional $\phi^4$ with up to $20^2$ lattice sites. Although the scope of our study is limited to a particular model architecture and training algorithm, initial results paint an interesting picture in which training costs grow very quickly indeed. We describe a candidate explanation for the poor scaling, and outline our intentions to clarify the situation in future work.
    Disjoint Contrastive Regression Learning for Multi-Sourced Annotations. (arXiv:2112.15411v1 [cs.LG])
    (2 min) Large-scale datasets are important for the development of deep learning models. Such datasets usually require a heavy workload of annotations, which are extremely time-consuming and expensive. To accelerate the annotation procedure, multiple annotators may be employed to label different subsets of the data. However, the inconsistency and bias among different annotators are harmful to the model training, especially for qualitative and subjective tasks.To address this challenge, in this paper, we propose a novel contrastive regression framework to address the disjoint annotations problem, where each sample is labeled by only one annotator and multiple annotators work on disjoint subsets of the data. To take account of both the intra-annotator consistency and inter-annotator inconsistency, two strategies are employed.Firstly, a contrastive-based loss is applied to learn the relative ranking among different samples of the same annotator, with the assumption that the ranking of samples from the same annotator is unanimous. Secondly, we apply the gradient reversal layer to learn robust representations that are invariant to different annotators. Experiments on the facial expression prediction task, as well as the image quality assessment task, verify the effectiveness of our proposed framework.
    DeePN$^2$: A deep learning-based non-Newtonian hydrodynamic model. (arXiv:2112.14798v1 [physics.comp-ph])
    (2 min) A long standing problem in the modeling of non-Newtonian hydrodynamics is the availability of reliable and interpretable hydrodynamic models that faithfully encode the underlying micro-scale polymer dynamics. The main complication arises from the long polymer relaxation time, the complex molecular structure, and heterogeneous interaction. DeePN$^2$, a deep learning-based non-Newtonian hydrodynamic model, has been proposed and has shown some success in systematically passing the micro-scale structural mechanics information to the macro-scale hydrodynamics for suspensions with simple polymer conformation and bond potential. The model retains a multi-scaled nature by mapping the polymer configurations into a set of symmetry-preserving macro-scale features. The extended constitutive laws for these macro-scale features can be directly learned from the kinetics of their micro-scale counterparts. In this paper, we carry out further study of DeePN$^2$ using more complex micro-structural models. We show that DeePN$^2$ can faithfully capture the broadly overlooked viscoelastic differences arising from the specific molecular structural mechanics without human intervention.
    Measuring and Sampling: A Metric-guided Subgraph Learning Framework for Graph Neural Network. (arXiv:2112.15015v1 [cs.LG])
    (2 min) Graph neural network (GNN) has shown convincing performance in learning powerful node representations that preserve both node attributes and graph structural information. However, many GNNs encounter problems in effectiveness and efficiency when they are designed with a deeper network structure or handle large-sized graphs. Several sampling algorithms have been proposed for improving and accelerating the training of GNNs, yet they ignore understanding the source of GNN performance gain. The measurement of information within graph data can help the sampling algorithms to keep high-value information while removing redundant information and even noise. In this paper, we propose a Metric-Guided (MeGuide) subgraph learning framework for GNNs. MeGuide employs two novel metrics: Feature Smoothness and Connection Failure Distance to guide the subgraph sampling and mini-batch based training. Feature Smoothness is designed for analyzing the feature of nodes in order to retain the most valuable information, while Connection Failure Distance can measure the structural information to control the size of subgraphs. We demonstrate the effectiveness and efficiency of MeGuide in training various GNNs on multiple datasets.
    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v1 [cs.LG])
    (2 min) One of the reasons that many neural networks are capable of replicating complicated tasks or functions is their universality property. The past few decades have seen many attempts in providing constructive proofs for single or class of neural networks. This paper is an effort to provide a unified and constructive framework for the universality of a large class of activations including most of existing activations and beyond. At the heart of the framework is the concept of neural network approximate identity. It turns out that most of existing activations are neural network approximate identity, and thus universal in the space of continuous of functions on compacta. The framework induces several advantages. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activations. Third, as a by product, the framework provides the first university proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it discovers new activations with guaranteed universality property. Indeed, any activation\textemdash whose $\k$th derivative, with $\k$ being an integer, is integrable and essentially bounded\textemdash is universal. Fifth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neuron, and the values of weights/biases.
    Survey Descent: A Multipoint Generalization of Gradient Descent for Nonsmooth Optimization. (arXiv:2111.15645v2 [math.OC] UPDATED)
    (2 min) For strongly convex objectives that are smooth, the classical theory of gradient descent ensures linear convergence relative to the number of gradient evaluations. An analogous nonsmooth theory is challenging: even when the objective is smooth at every iterate, the corresponding local models are unstable, and traditional remedies need unpredictably many cutting planes. We instead propose a multipoint generalization of the gradient descent iteration for local optimization. While designed with general objectives in mind, we are motivated by a "max-of-smooth" model that captures subdifferential dimension at optimality. We prove linear convergence when the objective is itself max-of-smooth, and experiments suggest a more general phenomenon.
    A Deep Reinforcement Learning Approach for Solving the Traveling Salesman Problem with Drone. (arXiv:2112.12545v2 [math.OC] UPDATED)
    (0 min) Reinforcement learning has recently shown promise in learning quality solutions in many combinatorial optimization problems. In particular, the attention-based encoder-decoder models show high effectiveness on various routing problems, including the Traveling Salesman Problem (TSP). Unfortunately, they perform poorly for the TSP with Drone (TSP-D), requiring routing a heterogeneous fleet of vehicles in coordination -- a truck and a drone. In TSP-D, the two vehicles are moving in tandem and may need to wait at a node for the other vehicle to join. State-less attention-based decoder fails to make such coordination between vehicles. We propose an attention encoder-LSTM decoder hybrid model, in which the decoder's hidden state can represent the sequence of actions made. We empirically demonstrate that such a hybrid model improves upon a purely attention-based model for both solution quality and computational efficiency. Our experiments on the min-max Capacitated Vehicle Routing Problem (mmCVRP) also confirm that the hybrid model is more suitable for coordinated routing of multiple vehicles than the attention-based model.
    Random Graph-Based Neuromorphic Learning with a Layer-Weaken Structure. (arXiv:2111.08888v2 [cs.LG] UPDATED)
    (0 min) Unified understanding of neuro networks (NNs) gets the users into great trouble because they have been puzzled by what kind of rules should be obeyed to optimize the internal structure of NNs. Considering the potential capability of random graphs to alter how computation is performed, we demonstrate that they can serve as architecture generators to optimize the internal structure of NNs. To transform the random graph theory into an NN model with practical meaning and based on clarifying the input-output relationship of each neuron, we complete data feature mapping by calculating Fourier Random Features (FRFs). Under the usage of this low-operation cost approach, neurons are assigned to several groups of which connection relationships can be regarded as uniform representations of random graphs they belong to, and random arrangement fuses those neurons to establish the pattern matrix, markedly reducing manual participation and computational cost without the fixed and deep architecture. Leveraging this single neuromorphic learning model termed random graph-based neuro network (RGNN) we develop a joint classification mechanism involving information interaction between multiple RGNNs and realize significant performance improvements in supervised learning for three benchmark tasks, whereby they effectively avoid the adverse impact of the interpretability of NNs on the structure design and engineering practice.
    Benign Overfitting in Adversarially Robust Linear Classification. (arXiv:2112.15250v1 [cs.LG])
    (0 min) "Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in over-parameterized linear regression, classification, and kernel methods. However, it is not clear if benign overfitting still occurs in the presence of adversarial examples, i.e., examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of sub-Gaussian data under $\ell_p$ adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the near-optimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.
    ifMixup: Towards Intrusion-Free Graph Mixup for Graph Classification. (arXiv:2110.09344v2 [cs.LG] UPDATED)
    (0 min) We present a simple and yet effective interpolation-based regularization technique to improve the generalization of Graph Neural Networks (GNNs). Our method leverages the recent advances in Mixup regularizer for vision and text, where random sample pairs and their labels are interpolated to create synthetic samples for training. Unlike images or natural sentences, graphs have arbitrary structure and topology, and even simply deleting or adding one edge from a graph can dramatically change its semantic meanings. This makes interpolating graph inputs very challenging because mixing graph pairs may naturally create graphs with identical structure but with conflict labels, causing the manifold intrusion issue. To cope with this obstacle, we propose a simple input mixing schema for Mixup on graph, coined ifMixup. We theoretically prove that, with a mild assumption, ifMixup guarantees that the mixed graphs are manifold intrusion free. We also empirically verify that ifMixup can effectively regularize the graph classification learning, resulting in superior predictive accuracy over popular graph augmentation baselines.
    Accelerated Primal-Dual Gradient Method for Smooth and Convex-Concave Saddle-Point Problems with Bilinear Coupling. (arXiv:2112.15199v1 [math.OC])
    (0 min) In this paper we study a convex-concave saddle-point problem $\min_x\max_y f(x) + y^\top\mathbf{A} x - g(y)$, where $f(x)$ and $g(y)$ are smooth and convex functions. We propose an Accelerated Primal-Dual Gradient Method for solving this problem which (i) achieves an optimal linear convergence rate in the strongly-convex-strongly-concave regime matching the lower complexity bound (Zhang et al., 2021) and (ii) achieves an accelerated linear convergence rate in the case when only one of the functions $f(x)$ and $g(y)$ is strongly convex or even none of them are. Finally, we obtain a linearly-convergent algorithm for the general smooth and convex-concave saddle point problem $\min_x\max_y F(x,y)$ without requirement of strong convexity or strong concavity.
    A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More. (arXiv:2112.15594v1 [cs.LG])
    (0 min) We demonstrate that a neural network pre-trained on text and fine-tuned on code solves Mathematics problems by program synthesis. We turn questions into programming tasks, automatically generate programs, and then execute them, perfectly solving university-level problems from MIT's large Mathematics courses (Single Variable Calculus 18.01, Multivariable Calculus 18.02, Differential Equations 18.03, Introduction to Probability and Statistics 18.05, Linear Algebra 18.06, and Mathematics for Computer Science 6.042) as well as questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems specifically designed to assess mathematical reasoning. We explore prompt generation methods that enable Transformers to generate question solving programs for these subjects, including solutions with plots. We generate correct answers for a random sample of questions in each topic. We quantify the gap between the original and transformed questions and perform a survey to evaluate the quality and difficulty of generated questions. This is the first work to automatically solve, grade, and generate university-level Mathematics course questions at scale which represents a milestone for higher education.
    Unintended Selection: Persistent Qualification Rate Disparities and Interventions. (arXiv:2111.01201v2 [cs.LG] UPDATED)
    (0 min) Realistically -- and equitably -- modeling the dynamics of group-level disparities in machine learning remains an open problem. In particular, we desire models that do not suppose inherent differences between artificial groups of people -- but rather endogenize disparities by appeal to unequal initial conditions of insular subpopulations. In this paper, agents each have a real-valued feature $X$ (e.g., credit score) informed by a "true" binary label $Y$ representing qualification (e.g., for a loan). Each agent alternately (1) receives a binary classification label $\hat{Y}$ (e.g., loan approval) from a Bayes-optimal machine learning classifier observing $X$ and (2) may update their qualification $Y$ by imitating successful strategies (e.g., seek a raise) within an isolated group $G$ of agents to which they belong. We consider the disparity of qualification rates $\Pr(Y=1)$ between different groups and how this disparity changes subject to a sequence of Bayes-optimal classifiers repeatedly retrained on the global population. We model the evolving qualification rates of each subpopulation (group) using the replicator equation, which derives from a class of imitation processes. We show that differences in qualification rates between subpopulations can persist indefinitely for a set of non-trivial equilibrium states due to uniformed classifier deployments, even when groups are identical in all aspects except initial qualification densities. We next simulate the effects of commonly proposed fairness interventions on this dynamical system along with a new feedback control mechanism capable of permanently eliminating group-level qualification rate disparities. We conclude by discussing the limitations of our model and findings and by outlining potential future work.
    on the effectiveness of generative adversarial network on anomaly detection. (arXiv:2112.15541v1 [cs.LG])
    (0 min) Identifying anomalies refers to detecting samples that do not resemble the training data distribution. Many generative models have been used to find anomalies, and among them, generative adversarial network (GAN)-based approaches are currently very popular. GANs mainly rely on the rich contextual information of these models to identify the actual training distribution. Following this analogy, we suggested a new unsupervised model based on GANs --a combination of an autoencoder and a GAN. Further, a new scoring function was introduced to target anomalies where a linear combination of the internal representation of the discriminator and the generator's visual representation, plus the encoded representation of the autoencoder, come together to define the proposed anomaly score. The model was further evaluated on benchmark datasets such as SVHN, CIFAR10, and MNIST, as well as a public medical dataset of leukemia images. In all the experiments, our model outperformed its existing counterparts while slightly improving the inference time.
    Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation. (arXiv:2110.06394v2 [cs.LG] UPDATED)
    (0 min) We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state. We show that to obtain an $\epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $\tilde{\mathcal{O}}(H^5d^2\epsilon^{-2})$ episodes during the exploration phase. Here, $H$ is the length of the episode, $d$ is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most $\tilde{\mathcal{O}}(H^4d(H + d)\epsilon^{-2})$ to achieve an $\epsilon$-optimal policy. By constructing a special class of linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least $\tilde \Omega(H^2d\epsilon^{-2})$ episodes to obtain an $\epsilon$-optimal policy. Our upper bound matches the lower bound in terms of the dependence on $\epsilon$ and the dependence on $d$ if $H \ge d$.
    Evaluation of Interpretability for Deep Learning algorithms in EEG Emotion Recognition: A case study in Autism. (arXiv:2111.13208v2 [eess.SP] UPDATED)
    (0 min) Current models on Explainable Artificial Intelligence (XAI) have shown an evident and quantified lack of reliability for measuring feature-relevance when statistically entangled features are proposed for training deep classifiers. There has been an increase in the application of Deep Learning in clinical trials to predict early diagnosis of neuro-developmental disorders, such as Autism Spectrum Disorder (ASD). However, the inclusion of more reliable saliency-maps to obtain more trustworthy and interpretable metrics using neural activity features is still insufficiently mature for practical applications in diagnostics or clinical trials. Moreover, in ASD research the inclusion of deep classifiers that use neural measures to predict viewed facial emotions is relatively unexplored. Therefore, in this study we propose the evaluation of a Convolutional Neural Network (CNN) for electroencephalography (EEG)-based facial emotion recognition decoding complemented with a novel RemOve-And-Retrain (ROAR) methodology to recover highly relevant features used in the classifier. Specifically, we compare well-known relevance maps such as Layer-Wise Relevance Propagation (LRP), PatternNet, Pattern-Attribution, and Smooth-Grad Squared. This study is the first to consolidate a more transparent feature-relevance calculation for a successful EEG-based facial emotion recognition using a within-subject-trained CNN in typically-developed and ASD individuals.
    Expected hypervolume improvement for simultaneous multi-objective and multi-fidelity optimization. (arXiv:2112.13901v2 [cs.LG] UPDATED)
    (0 min) Bayesian optimization has proven to be an efficient method to optimize expensive-to-evaluate systems. However, depending on the cost of single observations, multi-dimensional optimizations of one or more objectives may still be prohibitively expensive. Multi-fidelity optimization remedies this issue by including multiple, cheaper information sources such as low-resolution approximations in numerical simulations. Acquisition functions for multi-fidelity optimization are typically based on exploration-heavy algorithms that are difficult to combine with optimization towards multiple objectives. Here we show that the expected hypervolume improvement policy can act in many situations as a suitable substitute. We incorporate the evaluation cost either via a two-step evaluation or within a single acquisition function with an additional fidelity-related objective. This permits simultaneous multi-objective and multi-fidelity optimization, which allows to accurately establish the Pareto set and front at fractional cost. Benchmarks show a cost reduction of an order of magnitude or more. Our method thus allows for Pareto optimization of extremely expansive black-box functions. The presented methods are simple and straightforward to implement in existing, optimized Bayesian optimization frameworks and can immediately be extended to batch optimization. The techniques can also be used to combine different continuous and/or discrete fidelity dimensions, which makes them particularly relevant for simulation problems in plasma physics, fluid dynamics and many other branches of scientific computing.
    SGTR: End-to-end Scene Graph Generation with Transformer. (arXiv:2112.12970v2 [cs.CV] UPDATED)
    (0 min) Scene Graph Generation (SGG) remains a challenging visual understanding task due to its complex compositional property. Most previous works adopt a bottom-up two-stage or a point-based one-stage approach, which often suffers from overhead time complexity or sub-optimal design assumption. In this work, we propose a novel SGG method to address the aforementioned issues, which formulates the task as a bipartite graph construction problem. To solve the problem, we develop a transformer-based end-to-end framework that first generates the entity and predicate proposal set, followed by inferring directed edges to form the relation triplets. In particular, we develop a new entity-aware predicate representation based on a structural predicate generator to leverage the compositional property of relationships. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on two challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. We hope our model can serve as a strong baseline for the Transformer-based scene graph generation.
    Neural Thompson Sampling. (arXiv:2010.00827v2 [cs.LG] UPDATED)
    (0 min) Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of $\mathcal{O}(T^{1/2})$, which matches the regret of other contextual bandit algorithms in terms of total round number $T$. Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.
    Infinite wide (finite depth) Neural Networks benefit from multi-task learning unlike shallow Gaussian Processes -- an exact quantitative macroscopic characterization. (arXiv:2112.15577v1 [cs.LG])
    (0 min) We prove in this paper that wide ReLU neural networks (NNs) with at least one hidden layer optimized with l2-regularization on the parameters enforces multi-task learning due to representation-learning - also in the limit width to infinity. This is in contrast to multiple other idealized settings discussed in the literature where wide (ReLU)-NNs loose their ability to benefit from multi-task learning in the limit width to infinity. We deduce the multi-task learning ability from proving an exact quantitative macroscopic characterization of the learned NN in function space.
    An Efficient Epileptic Seizure Detection Technique using Discrete Wavelet Transform and Machine Learning Classifiers. (arXiv:2109.13811v2 [eess.SP] UPDATED)
    (0 min) This paper presents an epilepsy detection method based on discrete wavelet transform (DWT) and Machine learning classifiers. Here DWT has been used for feature extraction as it provides a better decomposition of the signals in different frequency bands. At first, DWT has been applied to the EEG signal to extract the detail and approximate coefficients or different sub-bands. After the extraction of the coefficients, principal component analysis (PCA) has been applied on different sub-bands and then a feature level fusion technique is used to extract the important features in low dimensional feature space. Three classifiers namely: Support Vector Machine (SVM) classifier, K-Nearest-Neighbor (KNN) classifier, and Naive Bayes (NB) Classifiers have been used in the proposed work for classifying the EEG signals. The proposed method is tested on Bonn databases and provides a maximum of 100% recognition accuracy for KNN, SVM, NB classifiers.
    BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation. (arXiv:2109.12271v2 [eess.IV] UPDATED)
    (0 min) Convolutional neural networks (CNNs) have achieved remarkable success in automatically segmenting organs or lesions on 3D medical images. Recently, vision transformer networks have exhibited exceptional performance in 2D image classification tasks. Compared with CNNs, transformer networks have an appealing advantage of extracting long-range features due to their self-attention algorithm. Therefore, we propose a CNN-Transformer combined model, called BiTr-Unet, with specific modifications for brain tumor segmentation on multi-modal MRI scans. Our BiTr-Unet achieves good performance on the BraTS2021 validation dataset with median Dice score 0.9335, 0.9304 and 0.8899, and median Hausdorff distance 2.8284, 2.2361 and 1.4142 for the whole tumor, tumor core, and enhancing tumor, respectively. On the BraTS2021 testing dataset, the corresponding results are 0.9257, 0.9350 and 0.8874 for Dice score, and 3, 2.2361 and 1.4142 for Hausdorff distance. The code is publicly available at https://github.com/JustaTinyDot/BiTr-Unet.
    Single-Shot Pruning for Offline Reinforcement Learning. (arXiv:2112.15579v1 [cs.LG])
    (0 min) Deep Reinforcement Learning (RL) is a powerful framework for solving complex real-world problems. Large neural networks employed in the framework are traditionally associated with better generalization capabilities, but their increased size entails the drawbacks of extensive training duration, substantial hardware resources, and longer inference times. One way to tackle this problem is to prune neural networks leaving only the necessary parameters. State-of-the-art concurrent pruning techniques for imposing sparsity perform demonstrably well in applications where data distributions are fixed. However, they have not yet been substantially explored in the context of RL. We close the gap between RL and single-shot pruning techniques and present a general pruning approach to the Offline RL. We leverage a fixed dataset to prune neural networks before the start of RL training. We then run experiments varying the network sparsity level and evaluating the validity of pruning at initialization techniques in continuous control tasks. Our results show that with 95% of the network weights pruned, Offline-RL algorithms can still retain performance in the majority of our experiments. To the best of our knowledge, no prior work utilizing pruning in RL retained performance at such high levels of sparsity. Moreover, pruning at initialization techniques can be easily integrated into any existing Offline-RL algorithms without changing the learning objective.
    Hamiltonian Dynamics with Non-Newtonian Momentum for Rapid Sampling. (arXiv:2111.02434v3 [cs.LG] UPDATED)
    (0 min) Sampling from an unnormalized probability distribution is a fundamental problem in machine learning with applications including Bayesian modeling, latent factor inference, and energy-based model training. After decades of research, variations of MCMC remain the default approach to sampling despite slow convergence. Auxiliary neural models can learn to speed up MCMC, but the overhead for training the extra model can be prohibitive. We propose a fundamentally different approach to this problem via a new Hamiltonian dynamics with a non-Newtonian momentum. In contrast to MCMC approaches like Hamiltonian Monte Carlo, no stochastic step is required. Instead, the proposed deterministic dynamics in an extended state space exactly sample the target distribution, specified by an energy function, under an assumption of ergodicity. Alternatively, the dynamics can be interpreted as a normalizing flow that samples a specified energy model without training. The proposed Energy Sampling Hamiltonian (ESH) dynamics have a simple form that can be solved with existing ODE solvers, but we derive a specialized solver that exhibits much better performance. ESH dynamics converge faster than their MCMC competitors enabling faster, more stable training of neural network energy models.
    Representation Learning via Consistent Assignment of Views to Clusters. (arXiv:2112.15421v1 [cs.LG])
    (0 min) We introduce Consistent Assignment for Representation Learning (CARL), an unsupervised learning method to learn visual representations by combining ideas from self-supervised contrastive learning and deep clustering. By viewing contrastive learning from a clustering perspective, CARL learns unsupervised representations by learning a set of general prototypes that serve as energy anchors to enforce different views of a given image to be assigned to the same prototype. Unlike contemporary work on contrastive learning with deep clustering, CARL proposes to learn the set of general prototypes in an online fashion, using gradient descent without the necessity of using non-differentiable algorithms or K-Means to solve the cluster assignment problem. CARL surpasses its competitors in many representations learning benchmarks, including linear evaluation, semi-supervised learning, and transfer learning.
    From Twitter to Traffic Predictor: Next-Day Morning Traffic Prediction Using Social Media Data. (arXiv:2009.13794v3 [cs.SI] UPDATED)
    (0 min) The effectiveness of traditional traffic prediction methods is often extremely limited when forecasting traffic dynamics in early morning. The reason is that traffic can break down drastically during the early morning commute, and the time and duration of this break-down vary substantially from day to day. Early morning traffic forecast is crucial to inform morning-commute traffic management, but they are generally challenging to predict in advance, particularly by midnight. In this paper, we propose to mine Twitter messages as a probing method to understand the impacts of people's work and rest patterns in the evening/midnight of the previous day to the next-day morning traffic. The model is tested on freeway networks in Pittsburgh as experiments. The resulting relationship is surprisingly simple and powerful. We find that, in general, the earlier people rest as indicated from Tweets, the more congested roads will be in the next morning. The occurrence of big events in the evening before, represented by higher or lower tweet sentiment than normal, often implies lower travel demand in the next morning than normal days. Besides, people's tweeting activities in the night before and early morning are statistically associated with congestion in morning peak hours. We make use of such relationships to build a predictive framework which forecasts morning commute congestion using people's tweeting profiles extracted by 5 am or as late as the midnight prior to the morning. The Pittsburgh study supports that our framework can precisely predict morning congestion, particularly for some road segments upstream of roadway bottlenecks with large day-to-day congestion variation. Our approach considerably outperforms those existing methods without Twitter message features, and it can learn meaningful representation of demand from tweeting profiles that offer managerial insights.
    PCACE: A Statistical Approach to Ranking Neurons for CNN Interpretability. (arXiv:2112.15571v1 [cs.CV])
    (0 min) In this paper we introduce a new problem within the growing literature of interpretability for convolution neural networks (CNNs). While previous work has focused on the question of how to visually interpret CNNs, we ask what it is that we care to interpret, that is, which layers and neurons are worth our attention? Due to the vast size of modern deep learning network architectures, automated, quantitative methods are needed to rank the relative importance of neurons so as to provide an answer to this question. We present a new statistical method for ranking the hidden neurons in any convolutional layer of a network. We define importance as the maximal correlation between the activation maps and the class score. We provide different ways in which this method can be used for visualization purposes with MNIST and ImageNet, and show a real-world application of our method to air pollution prediction with street-level images.
    Machine learning based disease diagnosis: A comprehensive review. (arXiv:2112.15538v1 [cs.LG])
    (0 min) Globally, there is a substantial unmet need to diagnose various diseases effectively. The complexity of the different disease mechanisms and underlying symptoms of the patient population presents massive challenges to developing the early diagnosis tool and effective treatment. Machine Learning (ML), an area of Artificial Intelligence (AI), enables researchers, physicians, and patients to solve some of these issues. Based on relevant research, this review explains how Machine Learning (ML) and Deep Learning (DL) are being used to help in the early identification of numerous diseases. To begin, a bibliometric study of the publication is given using data from the Scopus and Web of Science (WOS) databases. The bibliometric study of 1216 publications was undertaken to determine the most prolific authors, nations, organizations, and most cited articles. The review then summarizes the most recent trends and approaches in Machine Learning-based Disease Diagnosis (MLBDD), considering the following factors: algorithm, disease types, data type, application, and evaluation metrics. Finally, the paper highlights key results and provides insight into future trends and opportunities in the MLBDD area.
    BERTphone: Phonetically-Aware Encoder Representations for Utterance-Level Speaker and Language Recognition. (arXiv:1907.00457v2 [cs.CL] UPDATED)
    (0 min) We introduce BERTphone, a Transformer encoder trained on large speech corpora that outputs phonetically-aware contextual representation vectors that can be used for both speaker and language recognition. This is accomplished by training on two objectives: the first, inspired by adapting BERT to the continuous domain, involves masking spans of input frames and reconstructing the whole sequence for acoustic representation learning; the second, inspired by the success of bottleneck features from ASR, is a sequence-level CTC loss applied to phoneme labels for phonetic representation learning. We pretrain two BERTphone models (one on Fisher and one on TED-LIUM) and use them as feature extractors into x-vector-style DNNs for both tasks. We attain a state-of-the-art $C_{\text{avg}}$ of 6.16 on the challenging LRE07 3sec closed-set language recognition task. On Fisher and VoxCeleb speaker recognition tasks, we see an 18% relative reduction in speaker EER when training on BERTphone vectors instead of MFCCs. In general, BERTphone outperforms previous phonetic pretraining approaches on the same data. We release our code and models at https://github.com/awslabs/speech-representations.
    Data-driven advice for interpreting local and global model predictions in bioinformatics problems. (arXiv:2108.06201v2 [stat.ML] UPDATED)
    (0 min) Tree-based algorithms such as random forests and gradient boosted trees continue to be among the most popular and powerful machine learning models used across multiple disciplines. The conventional wisdom of estimating the impact of a feature in tree based models is to measure the \textit{node-wise reduction of a loss function}, which (i) yields only global importance measures and (ii) is known to suffer from severe biases. Conditional feature contributions (CFCs) provide \textit{local}, case-by-case explanations of a prediction by following the decision path and attributing changes in the expected output of the model to each feature along the path. However, Lundberg et al. pointed out a potential bias of CFCs which depends on the distance from the root of a tree. The by now immensely popular alternative, SHapley Additive exPlanation (SHAP) values appear to mitigate this bias but are computationally much more expensive. Here we contribute a thorough comparison of the explanations computed by both methods on a set of 164 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores, leading to very similar rankings and interpretations. Analogous conclusions hold for the fidelity of using global feature importance scores as a proxy for the predictive power associated with each feature.
    Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning. (arXiv:2112.11663v3 [cs.LG] UPDATED)
    (0 min) Alternating gradient-descent-ascent (AltGDA) is an optimization algorithm that has been widely used for model training in various machine learning applications, which aim to solve a nonconvex minimax optimization problem. However, the existing studies show that it suffers from a high computation complexity in nonconvex minimax optimization. In this paper, we develop a single-loop and fast AltGDA-type algorithm that leverages proximal gradient updates and momentum acceleration to solve regularized nonconvex minimax optimization problems. By identifying the intrinsic Lyapunov function of this algorithm, we prove that it converges to a critical point of the nonconvex minimax optimization problem and achieves a computation complexity $\mathcal{O}(\kappa^{1.5}\epsilon^{-2})$, where $\epsilon$ is the desired level of accuracy and $\kappa$ is the problem's condition number. Such a computation complexity improves the state-of-the-art complexities of single-loop GDA and AltGDA algorithms (see the summary of comparison in Table 1). We demonstrate the effectiveness of our algorithm via an experiment on adversarial deep learning.
    Learning Agent State Online with Recurrent Generate-and-Test. (arXiv:2112.15236v1 [cs.LG])
    (0 min) Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of experience as the agent state. Recurrent neural networks can learn the agent state, but the training methods are computationally expensive and sensitive to the hyper-parameters, making them unideal for online learning. This work introduces methods based on the generate-and-test approach to learn the agent state. A generate-and-test algorithm searches for state features by generating features and testing their usefulness. In this process, features useful for the agent's performance on the task are preserved, and the least useful features get replaced with newly generated features. We study the effectiveness of our methods on two online multi-step prediction problems. The first problem, trace conditioning, focuses on the agent's ability to remember a cue for a prediction multiple steps into the future. In the second problem, trace patterning, the agent needs to learn patterns in the observation signals and remember them for future predictions. We show that our proposed methods can effectively learn the agent state online and produce accurate predictions.
    Uniform-in-Phase-Space Data Selection with Iterative Normalizing Flows. (arXiv:2112.15446v1 [cs.LG])
    (0 min) Improvements in computational and experimental capabilities are rapidly increasing the amount of scientific data that is routinely generated. In applications that are constrained by memory and computational intensity, excessively large datasets may hinder scientific discovery, making data reduction a critical component of data-driven methods. Datasets are growing in two directions: the number of data points and their dimensionality. Whereas data compression techniques are concerned with reducing dimensionality, the focus here is on reducing the number of data points. A strategy is proposed to select data points such that they uniformly span the phase-space of the data. The algorithm proposed relies on estimating the probability map of the data and using it to construct an acceptance probability. An iterative method is used to accurately estimate the probability of the rare data points when only a small subset of the dataset is used to construct the probability map. Instead of binning the phase-space to estimate the probability map, its functional form is approximated with a normalizing flow. Therefore, the method naturally extends to high-dimensional datasets. The proposed framework is demonstrated as a viable pathway to enable data-efficient machine learning when abundant data is available. An implementation of the method is available in a companion repository (https://github.com/NREL/Phase-space-sampling).
    Triangular Flows for Generative Modeling: Statistical Consistency, Smoothness Classes, and Fast Rates. (arXiv:2112.15595v1 [stat.ML])
    (0 min) Triangular flows, also known as Kn\"{o}the-Rosenblatt measure couplings, comprise an important building block of normalizing flow models for generative modeling and density estimation, including popular autoregressive flow models such as real-valued non-volume preserving transformation models (Real NVP). We present statistical guarantees and sample complexity bounds for triangular flow statistical models. In particular, we establish the statistical consistency and the finite sample convergence rates of the Kullback-Leibler estimator of the Kn\"{o}the-Rosenblatt measure coupling using tools from empirical process theory. Our results highlight the anisotropic geometry of function classes at play in triangular flows, shed light on optimal coordinate ordering, and lead to statistical guarantees for Jacobian flows. We conduct numerical experiments on synthetic data to illustrate the practical implications of our theoretical findings.
    Improving Baselines in the Wild. (arXiv:2112.15550v1 [cs.LG])
    (0 min) We share our experience with the recently released WILDS benchmark, a collection of ten datasets dedicated to developing models and training strategies which are robust to domain shifts. Several experiments yield a couple of critical observations which we believe are of general interest for any future work on WILDS. Our study focuses on two datasets: iWildCam and FMoW. We show that (1) Conducting separate cross-validation for each evaluation metric is crucial for both datasets, (2) A weak correlation between validation and test performance might make model development difficult for iWildCam, (3) Minor changes in the training of hyper-parameters improve the baseline by a relatively large margin (mainly on FMoW), (4) There is a strong correlation between certain domains and certain target labels (mainly on iWildCam). To the best of our knowledge, no prior work on these datasets has reported these observations despite their obvious importance. Our code is public.
    Learning the Regularization in DCE-MR Image Reconstruction for Functional Imaging of Kidneys. (arXiv:2109.07548v2 [cs.LG] UPDATED)
    (0 min) Kidney DCE-MRI aims at both qualitative assessment of kidney anatomy and quantitative assessment of kidney function by estimating the tracer kinetic (TK) model parameters. Accurate estimation of TK model parameters requires an accurate measurement of the arterial input function (AIF) with high temporal resolution. Accelerated imaging is used to achieve high temporal resolution, which yields under-sampling artifacts in the reconstructed images. Compressed sensing (CS) methods offer a variety of reconstruction options. Most commonly, sparsity of temporal differences is encouraged for regularization to reduce artifacts. Increasing regularization in CS methods removes the ambient artifacts but also over-smooths the signal temporally which reduces the parameter estimation accuracy. In this work, we propose a single image trained deep neural network to reduce MRI under-sampling artifacts without reducing the accuracy of functional imaging markers. Instead of regularizing with a penalty term in optimization, we promote regularization by generating images from a lower dimensional representation. In this manuscript we motivate and explain the lower dimensional input design. We compare our approach to CS reconstructions with multiple regularization weights. Proposed approach results in kidney biomarkers that are highly correlated with the ground truth markers estimated using the CS reconstruction which was optimized for functional analysis. At the same time, the proposed approach reduces the artifacts in the reconstructed images.
    A sampling-based approach for efficient clustering in large datasets. (arXiv:2112.14793v1 [cs.LG])
    (0 min) We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.
    NCIS: Neural Contextual Iterative Smoothing for Purifying Adversarial Perturbations. (arXiv:2106.11644v2 [cs.CV] UPDATED)
    (0 min) We propose a novel and effective purification based adversarial defense method against pre-processor blind white- and black-box attacks. Our method is computationally efficient and trained only with self-supervised learning on general images, without requiring any adversarial training or retraining of the classification model. We first show an empirical analysis on the adversarial noise, defined to be the residual between an original image and its adversarial example, has almost zero mean, symmetric distribution. Based on this observation, we propose a very simple iterative Gaussian Smoothing (GS) which can effectively smooth out adversarial noise and achieve substantially high robust accuracy. To further improve it, we propose Neural Contextual Iterative Smoothing (NCIS), which trains a blind-spot network (BSN) in a self-supervised manner to reconstruct the discriminative features of the original image that is also smoothed out by GS. From our extensive experiments on the large-scale ImageNet using four classification models, we show that our method achieves both competitive standard accuracy and state-of-the-art robust accuracy against most strong purifier-blind white- and black-box attacks. Also, we propose a new benchmark for evaluating a purification method based on commercial image classification APIs, such as AWS, Azure, Clarifai and Google. We generate adversarial examples by ensemble transfer-based black-box attack, which can induce complete misclassification of APIs, and demonstrate that our method can be used to increase adversarial robustness of APIs.
    An objective function for order preserving hierarchical clustering. (arXiv:2109.04266v2 [cs.LG] UPDATED)
    (0 min) We present an objective function for similarity based hierarchical clustering of partially ordered data that preserves the partial order. That is, if $x \le y$, and if $[x]$ and $[y]$ are the respective clusters of $x$ and $y$, then there is an order relation $\le'$ on the clusters for which $[x] \le' |y]$. The theory distinguishes itself from existing theories for clustering of ordered data in that the order relation and the similarity are combined into a bi-objective optimisation problem to obtain a hierarchical clustering seeking to satisfy both. In particular, the order relation is weighted in the range $[0,1]$, and if the similarity and the order relation are not aligned, then order preservation may have to yield in favor of clustering. Finding an optimal solution is NP-hard, so we provide a polynomial time approximation algorithm, with a relative performance guarantee of $O\!\left(\log^{3/2} \!\!\, n \right)$, based on successive applications of directed sparsest cut. We provide a demonstration on a benchmark dataset, showing that our method outperforms existing methods for order preserving hierarchical clustering with significant margin. The theory is an extension of the Dasgupta cost function for divisive hierarchical clustering.
    How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?. (arXiv:1911.12360v4 [cs.LG] UPDATED)
    (0 min) A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $\epsilon^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.
    Training and Generating Neural Networks in Compressed Weight Space. (arXiv:2112.15545v1 [cs.LG])
    (0 min) The inputs and/or outputs of some neural nets are weight matrices of other neural nets. Indirect encodings or end-to-end compression of weight matrices could help to scale such approaches. Our goal is to open a discussion on this topic, starting with recurrent neural networks for character-level language modelling whose weight matrices are encoded by the discrete cosine transform. Our fast weight version thereof uses a recurrent neural network to parameterise the compressed weights. We present experimental results on the enwik8 dataset.
    Actor Loss of Soft Actor Critic Explained. (arXiv:2112.15568v1 [cs.LG])
    (0 min) This technical report is devoted to explaining how the actor loss of soft actor critic is obtained, as well as the associated gradient estimate. It gives the necessary mathematical background to derive all the presented equations, from the theoretical actor loss to the one implemented in practice. This necessitates a comparison of the reparameterization trick used in soft actor critic with the nabla log trick, which leads to open questions regarding the most efficient method to use.
    Hierarchical forecasting with a top-down alignment of independent level forecasts. (arXiv:2103.08250v4 [stat.ML] UPDATED)
    (0 min) Hierarchical forecasting with intermittent time series is a challenge in both research and empirical studies. Extensive research focuses on improving the accuracy of each hierarchy, especially the intermittent time series at bottom levels. Then hierarchical reconciliation could be used to improve the overall performance further. In this paper, we present a \emph{hierarchical-forecasting-with-alignment} approach that treats the bottom level forecasts as mutable to ensure higher forecasting accuracy on the upper levels of the hierarchy. We employ a pure deep learning forecasting approach N-BEATS for continuous time series at the top levels and a widely used tree-based algorithm LightGBM for the intermittent time series at the bottom level. The \emph{hierarchical-forecasting-with-alignment} approach is a simple yet effective variant of the bottom-up method, accounting for biases that are difficult to observe at the bottom level. It allows suboptimal forecasts at the lower level to retain a higher overall performance. The approach in this empirical study was developed by the first author during the M5 Forecasting Accuracy competition, ranking second place. The method is also business orientated and could benefit for business strategic planning.
    A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic. (arXiv:2007.05170v3 [math.OC] UPDATED)
    (0 min) This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization. Bilevel optimization is a class of problems which exhibit a two-level structure, and its goal is to minimize an outer objective function with variables which are constrained to be the optimal solution to an (inner) optimization problem. We consider the case when the inner problem is unconstrained and strongly convex, while the outer problem is constrained and has a smooth objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger step size is used for the inner problem, while a projected stochastic gradient update with a smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA algorithm under various settings: when the outer problem is strongly convex (resp.~weakly convex), the TTSA algorithm finds an $\mathcal{O}(K^{-2/3})$-optimal (resp.~$\mathcal{O}(K^{-2/5})$-stationary) solution, where $K$ is the total iteration number. As an application, we show that a two-timescale natural actor-critic proximal policy optimization algorithm can be viewed as a special case of our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at a rate of $\mathcal{O}(K^{-1/4})$ in terms of the gap in expected discounted reward compared to a global optimal policy.
    Efficient Robust Training via Backward Smoothing. (arXiv:2010.01278v2 [cs.LG] UPDATED)
    (0 min) Adversarial training is so far the most effective strategy in defending against adversarial examples. However, it suffers from high computational costs due to the iterative adversarial attacks in each training step. Recent studies show that it is possible to achieve fast Adversarial Training by performing a single-step attack with random initialization. However, such an approach still lags behind state-of-the-art adversarial training algorithms on both stability and model robustness. In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem. Following this new perspective, we also propose a new initialization strategy, backward smoothing, to further improve the stability and model robustness over single-step robust training methods. Experiments on multiple benchmarks demonstrate that our method achieves similar model robustness as the original TRADES method while using much less training time ($\sim$3x improvement with the same training schedule).
    Refining Language Models with Compositional Explanations. (arXiv:2103.10415v3 [cs.CL] UPDATED)
    (0 min) Pre-trained language models have been successful on text classification tasks, but are prone to learning spurious correlations from biased datasets, and are thus vulnerable when making inferences in a new domain. Prior work reveals such spurious patterns via post-hoc explanation algorithms which compute the importance of input features. Further, the model is regularized to align the importance scores with human knowledge, so that the unintended model behaviors are eliminated. However, such a regularization technique lacks flexibility and coverage, since only importance scores towards a pre-defined list of features are adjusted, while more complex human knowledge such as feature interaction and pattern generalization can hardly be incorporated. In this work, we propose to refine a learned language model for a target domain by collecting human-provided compositional explanations regarding observed biases. By parsing these explanations into executable logic rules, the human-specified refinement advice from a small set of explanations can be generalized to more training examples. We additionally introduce a regularization term allowing adjustments for both importance and interaction of features to better rectify model behavior. We demonstrate the effectiveness of the proposed approach on two text classification tasks by showing improved performance in target domain as well as improved model fairness after refinement.
    Separation of scales and a thermodynamic description of feature learning in some CNNs. (arXiv:2112.15383v1 [stat.ML])
    (0 min) Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Due to their scale and complexity, often involving billions of inter-dependent internal degrees of freedom, exact analysis approaches often fall short. A common strategy in such cases is to identify slow degrees of freedom that average out the erratic behavior of the underlying fast microscopic variables. Here, we identify such a separation of scales occurring in over-parameterized deep convolutional neural networks (CNNs) at the end of training. It implies that neuron pre-activations fluctuate in a nearly Gaussian manner with a deterministic latent kernel. While for CNNs with infinitely many channels these kernels are inert, for finite CNNs they adapt and learn from data in an analytically tractable manner. The resulting thermodynamic theory of deep learning yields accurate predictions on several deep non-linear CNN toy models. In addition, it provides new ways of analyzing and understanding CNNs.
    An Unsupervised Domain Adaptation Model based on Dual-module Adversarial Training. (arXiv:2112.15555v1 [cs.LG])
    (0 min) In this paper, we propose a dual-module network architecture that employs a domain discriminative feature module to encourage the domain invariant feature module to learn more domain invariant features. The proposed architecture can be applied to any model that utilizes domain invariant features for unsupervised domain adaptation to improve its ability to extract domain invariant features. We conduct experiments with the Domain-Adversarial Training of Neural Networks (DANN) model as a representative algorithm. In the training process, we supply the same input to the two modules and then extract their feature distribution and prediction results respectively. We propose a discrepancy loss to find the discrepancy of the prediction results and the feature distribution between the two modules. Through the adversarial training by maximizing the loss of their feature distribution and minimizing the discrepancy of their prediction results, the two modules are encouraged to learn more domain discriminative and domain invariant features respectively. Extensive comparative evaluations are conducted and the proposed approach outperforms the state-of-the-art in most unsupervised domain adaptation tasks.
    A Critical Review of Inductive Logic Programming Techniques for Explainable AI. (arXiv:2112.15319v1 [cs.LG])
    (0 min) Despite recent advances in modern machine learning algorithms, the opaqueness of their underlying mechanisms continues to be an obstacle in adoption. To instill confidence and trust in artificial intelligence systems, Explainable Artificial Intelligence has emerged as a response to improving modern machine learning algorithms' explainability. Inductive Logic Programming (ILP), a subfield of symbolic artificial intelligence, plays a promising role in generating interpretable explanations because of its intuitive logic-driven framework. ILP effectively leverages abductive reasoning to generate explainable first-order clausal theories from examples and background knowledge. However, several challenges in developing methods inspired by ILP need to be addressed for their successful application in practice. For example, existing ILP systems often have a vast solution space, and the induced solutions are very sensitive to noises and disturbances. This survey paper summarizes the recent advances in ILP and a discussion of statistical relational learning and neural-symbolic algorithms, which offer synergistic views to ILP. Following a critical review of the recent advances, we delineate observed challenges and highlight potential avenues of further ILP-motivated research toward developing self-explanatory artificial intelligence systems.
    Simplifying Software Defect Prediction (via the "early bird" Heuristic). (arXiv:2105.11082v2 [cs.SE] UPDATED)
    (0 min) Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, then perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 GitHub projects, where we find that the information in those projects "clumped" towards the earliest parts of the project. A defect prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this early life cycle data, we can build models very quickly, very early in the software project life cycle. Moreover, using this method, we have shown that a simple model (with just two features) generalizes to hundreds of software projects. Based on this experience, we doubt that prior work on generalizing software engineering defect prediction models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are online at https://github.com/snaraya7/simplifying-software-analytics
    Hybrid Adversarial Imitation Learning. (arXiv:2102.02454v10 [cs.LG] UPDATED)
    (0 min) Extrapolating beyond-demonstrator (BD) performance through the imitation learning (IL) algorithm aims to learn from and outperform the demonstrator. Most existing BDIL algorithms are performed in two stages by first inferring a reward function before learning a policy via reinforcement learning (RL). However, such two-stage BDIL algorithms suffer from high computational complexity, weak robustness, and large performance variations. In particular, a poor reward function derived in the first stage will inevitably incur severe performance loss in the second stage. In this work, we propose a hybrid adversarial imitation learning (HAIL) algorithm that is one-stage, model-free, generative-adversarial (GA) fashion and curiosity-driven. Thanks to the one-stage design, the HAIL can integrate both the reward function learning and the policy optimization into one procedure, which leads to many advantages such as low computational complexity, high robustness, and strong adaptability. More specifically, HAIL simultaneously imitates the demonstrator and explores BD performance by utilizing hybrid rewards. Extensive simulation results confirm that HAIL can achieve higher performance as compared to other similar BDIL algorithms.
    Fast Learning of MNL Model from General Partial Rankings with Application to Network Formation Modeling. (arXiv:2112.15575v1 [cs.LG])
    (0 min) Multinomial Logit (MNL) is one of the most popular discrete choice models and has been widely used to model ranking data. However, there is a long-standing technical challenge of learning MNL from many real-world ranking data: exact calculation of the MNL likelihood of \emph{partial rankings} is generally intractable. In this work, we develop a scalable method for approximating the MNL likelihood of general partial rankings in polynomial time complexity. We also extend the proposed method to learn mixture of MNL. We demonstrate that the proposed methods are particularly helpful for applications to choice-based network formation modeling, where the formation of new edges in a network is viewed as individuals making choices of their friends over a candidate set. The problem of learning mixture of MNL models from partial rankings naturally arises in such applications. And the proposed methods can be used to learn MNL models from network data without the strong assumption that temporal orders of all the edge formation are available. We conduct experiments on both synthetic and real-world network data to demonstrate that the proposed methods achieve more accurate parameter estimation and better fitness of data compared to conventional methods.
    Efficient and Reliable Overlay Networks for Decentralized Federated Learning. (arXiv:2112.15486v1 [cs.NI])
    (0 min) We propose near-optimal overlay networks based on $d$-regular expander graphs to accelerate decentralized federated learning (DFL) and improve its generalization. In DFL a massive number of clients are connected by an overlay network, and they solve machine learning problems collaboratively without sharing raw data. Our overlay network design integrates spectral graph theory and the theoretical convergence and generalization bounds for DFL. As such, our proposed overlay networks accelerate convergence, improve generalization, and enhance robustness to clients failures in DFL with theoretical guarantees. Also, we present an efficient algorithm to convert a given graph to a practical overlay network and maintaining the network topology after potential client failures. We numerically verify the advantages of DFL with our proposed networks on various benchmark tasks, ranging from image classification to language modeling using hundreds of clients.
    ViNMT: Neural Machine Translation Tookit. (arXiv:2112.15272v1 [cs.CL])
    (0 min) We present an open-source toolkit for neural machine translation (NMT). The new toolkit is mainly based on vaulted Transformer (Vaswani et al., 2017) along with many other improvements detailed below, in order to create a self-contained, simple to use, consistent and comprehensive framework for Machine Translation tasks of various domains. It is tooled to support both bilingual and multilingual translation tasks, starting from building the model from respective corpora, to inferring new predictions or packaging the model to serving-capable JIT format.
    AutoFITS: Automatic Feature Engineering for Irregular Time Series. (arXiv:2112.14806v1 [cs.LG])
    (0 min) A time series represents a set of observations collected over time. Typically, these observations are captured with a uniform sampling frequency (e.g. daily). When data points are observed in uneven time intervals the time series is referred to as irregular or intermittent. In such scenarios, the most common solution is to reconstruct the time series to make it regular, thus removing its intermittency. We hypothesise that, in irregular time series, the time at which each observation is collected may be helpful to summarise the dynamics of the data and improve forecasting performance. We study this idea by developing a novel automatic feature engineering framework, which focuses on extracting information from this point of view, i.e., when each instance is collected. We study how valuable this information is by integrating it in a time series forecasting workflow and investigate how it compares to or complements state-of-the-art methods for regular time series forecasting. In the end, we contribute by providing a novel framework that tackles feature engineering for time series from an angle previously vastly ignored. We show that our approach has the potential to further extract more information about time series that significantly improves forecasting performance.
    Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction. (arXiv:2112.15012v1 [cs.CV])
    (0 min) Predicting human motion from historical pose sequence is crucial for a machine to succeed in intelligent interactions with humans. One aspect that has been obviated so far, is the fact that how we represent the skeletal pose has a critical impact on the prediction results. Yet there is no effort that investigates across different pose representation schemes. We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task. Moreover, recent approaches build upon off-the-shelf RNN units for motion prediction. These approaches process input pose sequence sequentially and inherently have difficulties in capturing long-term dependencies. In this paper, we propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction which simultaneously models local motion contexts and a global context. We further explore a geodesic loss and a forward kinematics loss for the motion prediction task, which have more geometric significance than the widely employed L2 loss. Interestingly, we applied our method to a range of articulate objects including human, fish, and mouse. Empirical results show that our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency, such as retaining natural human-like motions over 50 seconds predictions. Our codes are released.
    A General Traffic Shaping Protocol in E-Commerce. (arXiv:2112.14941v1 [cs.LG])
    (0 min) To approach different business objectives, online traffic shaping algorithms aim at improving exposures of a target set of items, such as boosting the growth of new commodities. Generally, these algorithms assume that the utility of each user-item pair can be accessed via a well-trained conversion rate prediction model. However, for real E-Commerce platforms, there are unavoidable factors preventing us from learning such an accurate model. In order to break the heavy dependence on accurate inputs of the utility, we propose a general online traffic shaping protocol for online E-Commerce applications. In our framework, we approximate the function mapping the bonus scores, which generally are the only method to influence the ranking result in the traffic shaping problem, to the numbers of exposures and purchases. Concretely, we approximate the above function by a class of the piece-wise linear function constructed on the convex hull of the explored data points. Moreover, we reformulate the online traffic shaping problem as linear programming where these piece-wise linear functions are embedded into both the objective and constraints. Our algorithm can straightforwardly optimize the linear programming in the prime space, and its solution can be simply applied by a stochastic strategy to fulfill the optimized objective and the constraints in expectation. Finally, the online A/B test shows our proposed algorithm steadily outperforms the previous industrial level traffic shaping algorithm.
    BP-Net: Cuff-less, Calibration-free, and Non-invasive Blood Pressure Estimation via a Generic Deep Convolutional Architecture. (arXiv:2112.15271v1 [cs.LG])
    (0 min) Objective: The paper focuses on development of robust and accurate processing solutions for continuous and cuff-less blood pressure (BP) monitoring. In this regard, a robust deep learning-based framework is proposed for computation of low latency, continuous, and calibration-free upper and lower bounds on the systolic and diastolic BP. Method: Referred to as the BP-Net, the proposed framework is a novel convolutional architecture that provides longer effective memory while achieving superior performance due to incorporation of casual dialated convolutions and residual connections. To utilize the real potential of deep learning in extraction of intrinsic features (deep features) and enhance the long-term robustness, the BP-Net uses raw Electrocardiograph (ECG) and Photoplethysmograph (PPG) signals without extraction of any form of hand-crafted features as it is common in existing solutions. Results: By capitalizing on the fact that datasets used in recent literature are not unified and properly defined, a benchmark dataset is constructed from the MIMIC-I and MIMIC-III databases obtained from PhysioNet. The proposed BP-Net is evaluated based on this benchmark dataset demonstrating promising performance and shows superior generalizable capacity. Conclusion: The proposed BP-Net architecture is more accurate than canonical recurrent networks and enhances the long-term robustness of the BP estimation task. Significance: The proposed BP-Net architecture addresses key drawbacks of existing BP estimation solutions, i.e., relying heavily on extraction of hand-crafted features, such as pulse arrival time (PAT), and; Lack of robustness. Finally, the constructed BP-Net dataset provides a unified base for evaluation and comparison of deep learning-based BP estimation algorithms.
    Calibrated Hyperspectral Image Reconstruction via Graph-based Self-Tuning Network. (arXiv:2112.15362v1 [eess.IV])
    (0 min) Recently, hyperspectral imaging (HSI) has attracted increasing research attention, especially for the ones based on a coded aperture snapshot spectral imaging (CASSI) system. Existing deep HSI reconstruction models are generally trained on paired data to retrieve original signals upon 2D compressed measurements given by a particular optical hardware mask in CASSI, during which the mask largely impacts the reconstruction performance and could work as a "model hyperparameter" governing on data augmentations. This mask-specific training style will lead to a hardware miscalibration issue, which sets up barriers to deploying deep HSI models among different hardware and noisy environments. To address this challenge, we introduce mask uncertainty for HSI with a complete variational Bayesian learning treatment and explicitly model it through a mask decomposition inspired by real hardware. Specifically, we propose a novel Graph-based Self-Tuning (GST) network to reason uncertainties adapting to varying spatial structures of masks among different hardware. Moreover, we develop a bilevel optimization framework to balance HSI reconstruction and uncertainty estimation, accounting for the hyperparameter property of masks. Extensive experimental results and model discussions validate the effectiveness (over 33/30 dB) of the proposed GST method under two miscalibration scenarios and demonstrate a highly competitive performance compared with the state-of-the-art well-calibrated methods. Our code and pre-trained model are available at https://github.com/Jiamian Wang/mask_uncertainty_spectral_SCI
    Multi-Agent Reinforcement Learning via Adaptive Kalman Temporal Difference and Successor Representation. (arXiv:2112.15156v1 [cs.LG])
    (0 min) Distributed Multi-Agent Reinforcement Learning (MARL) algorithms has attracted a surge of interest lately mainly due to the recent advancements of Deep Neural Networks (DNNs). Conventional Model-Based (MB) or Model-Free (MF) RL algorithms are not directly applicable to the MARL problems due to utilization of a fixed reward model for learning the underlying value function. While DNN-based solutions perform utterly well when a single agent is involved, such methods fail to fully generalize to the complexities of MARL problems. In other words, although recently developed approaches based on DNNs for multi-agent environments have achieved superior performance, they are still prone to overfiting, high sensitivity to parameter selection, and sample inefficiency. The paper proposes the Multi-Agent Adaptive Kalman Temporal Difference (MAK-TD) framework and its Successor Representation-based variant, referred to as the MAK-SR. Intuitively speaking, the main objective is to capitalize on unique characteristics of Kalman Filtering (KF) such as uncertainty modeling and online second order learning. The proposed MAK-TD/SR frameworks consider the continuous nature of the action-space that is associated with high dimensional multi-agent environments and exploit Kalman Temporal Difference (KTD) to address the parameter uncertainty. By leveraging the KTD framework, SR learning procedure is modeled into a filtering problem, where Radial Basis Function (RBF) estimators are used to encode the continuous space into feature vectors. On the other hand, for learning localized reward functions, we resort to Multiple Model Adaptive Estimation (MMAE), to deal with the lack of prior knowledge on the observation noise covariance and observation mapping function. The proposed MAK-TD/SR frameworks are evaluated via several experiments, which are implemented through the OpenAI Gym MARL benchmarks.
    Robustness and risk management via distributional dynamic programming. (arXiv:2112.15430v1 [cs.LG])
    (0 min) In dynamic programming (DP) and reinforcement learning (RL), an agent learns to act optimally in terms of expected long-term return by sequentially interacting with its environment modeled by a Markov decision process (MDP). More generally in distributional reinforcement learning (DRL), the focus is on the whole distribution of the return, not just its expectation. Although DRL-based methods produced state-of-the-art performance in RL with function approximation, they involve additional quantities (compared to the non-distributional setting) that are still not well understood. As a first contribution, we introduce a new class of distributional operators, together with a practical DP algorithm for policy evaluation, that come with a robust MDP interpretation. Indeed, our approach reformulates through an augmented state space where each state is split into a worst-case substate and a best-case substate, whose values are maximized by safe and risky policies respectively. Finally, we derive distributional operators and DP algorithms solving a new control task: How to distinguish safe from risky optimal actions in order to break ties in the space of optimal policies?
    Self Reward Design with Fine-grained Interpretability. (arXiv:2112.15034v1 [cs.LG])
    (0 min) Transparency and fairness issues in Deep Reinforcement Learning may stem from the black-box nature of deep neural networks used to learn its policy, value functions etc. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks (NN) with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. With deliberate design, we show that lavaland problems can be solved using NN model with few parameters. Furthermore, we introduce the Self Reward Design (SRD), inspired by the Inverse Reward Design, so that our interpretable design can (1) solve the problem by pure design (although imperfectly) (2) be optimized via SRD (3) perform avoidance of unknown states by recognizing the inactivations of neurons aggregated as the activation in \(w_{unknown}\).
    Distributed Random Reshuffling over Networks. (arXiv:2112.15287v1 [math.OC])
    (0 min) In this paper, we consider the distributed optimization problem where $n$ agents, each possessing a local cost function, collaboratively minimize the average of the local cost functions over a connected network. To solve the problem, we propose a distributed random reshuffling (D-RR) algorithm that combines the classical distributed gradient descent (DGD) method and Random Reshuffling (RR). We show that D-RR inherits the superiority of RR for both smooth strongly convex and smooth nonconvex objective functions. In particular, for smooth strongly convex objective functions, D-RR achieves $\mathcal{O}(1/T^2)$ rate of convergence (here, $T$ counts the total number of iterations) in terms of the squared distance between the iterate and the unique minimizer. When the objective function is assumed to be smooth nonconvex and has Lipschitz continuous component functions, we show that D-RR drives the squared norm of gradient to $0$ at a rate of $\mathcal{O}(1/T^{2/3})$. These convergence results match those of centralized RR (up to constant factors).
    Scalable Deep Graph Clustering with Random-walk based Self-supervised Learning. (arXiv:2112.15530v1 [cs.LG])
    (0 min) Web-based interactions can be frequently represented by an attributed graph, and node clustering in such graphs has received much attention lately. Multiple efforts have successfully applied Graph Convolutional Networks (GCN), though with some limits on accuracy as GCNs have been shown to suffer from over-smoothing issues. Though other methods (particularly those based on Laplacian Smoothing) have reported better accuracy, a fundamental limitation of all the work is a lack of scalability. This paper addresses this open problem by relating the Laplacian smoothing to the Generalized PageRank and applying a random-walk based algorithm as a scalable graph filter. This forms the basis for our scalable deep clustering algorithm, RwSL, where through a self-supervised mini-batch training mechanism, we simultaneously optimize a deep neural network for sample-cluster assignment distribution and an autoencoder for a clustering-oriented embedding. Using 6 real-world datasets and 6 clustering metrics, we show that RwSL achieved improved results over several recent baselines. Most notably, we show that RwSL, unlike all other deep clustering frameworks, can continue to scale beyond graphs with more than one million nodes, i.e., handle web-scale. We also demonstrate how RwSL could perform node clustering on a graph with 1.8 billion edges using only a single GPU.
    Shift-Equivariant Similarity-Preserving Hypervector Representations of Sequences. (arXiv:2112.15475v1 [cs.AI])
    (0 min) Hyperdimensional Computing (HDC), also known as Vector-Symbolic Architectures (VSA), is a promising framework for the development of cognitive architectures and artificial intelligence systems, as well as for technical applications and emerging neuromorphic and nanoscale hardware. HDC/VSA operate with hypervectors, i.e., distributed vector representations of large fixed dimension (usually > 1000). One of the key ingredients of HDC/VSA are the methods for encoding data of various types (from numeric scalars and vectors to graphs) into hypervectors. In this paper, we propose an approach for the formation of hypervectors of sequences that provides both an equivariance with respect to the shift of sequences and preserves the similarity of sequences with identical elements at nearby positions. Our methods represent the sequence elements by compositional hypervectors and exploit permutations of hypervectors for representing the order of sequence elements. We experimentally explored the proposed representations using a diverse set of tasks with data in the form of symbolic strings. Although our approach is feature-free as it forms the hypervector of a sequence from the hypervectors of its symbols at their positions, it demonstrated the performance on a par with the methods that apply various features, such as subsequences. The proposed techniques were designed for the HDC/VSA model known as Sparse Binary Distributed Representations. However, they can be adapted to hypervectors in formats of other HDC/VSA models, as well as for representing sequences of types other than symbolic strings.
    Modelling of Bi-directional Spatio-Temporal Dependence and Users' Dynamic Preferences for Missing POI Check-in Identification. (arXiv:2112.15285v1 [cs.LG])
    (0 min) Human mobility data accumulated from Point-of-Interest (POI) check-ins provides great opportunity for user behavior understanding. However, data quality issues (e.g., geolocation information missing, unreal check-ins, data sparsity) in real-life mobility data limit the effectiveness of existing POI-oriented studies, e.g., POI recommendation and location prediction, when applied to real applications. To this end, in this paper, we develop a model, named Bi-STDDP, which can integrate bi-directional spatio-temporal dependence and users' dynamic preferences, to identify the missing POI check-in where a user has visited at a specific time. Specifically, we first utilize bi-directional global spatial and local temporal information of POIs to capture the complex dependence relationships. Then, target temporal pattern in combination with user and POI information are fed into a multi-layer network to capture users' dynamic preferences. Moreover, the dynamic preferences are transformed into the same space as the dependence relationships to form the final model. Finally, the proposed model is evaluated on three large-scale real-world datasets and the results demonstrate significant improvements of our model compared with state-of-the-art methods. Also, it is worth noting that the proposed model can be naturally extended to address POI recommendation and location prediction tasks with competitive performances.
    Exploring the pattern of Emotion in children with ASD as an early biomarker through Recurring-Convolution Neural Network (R-CNN). (arXiv:2112.14983v1 [cs.CV])
    (0 min) Autism Spectrum Disorder (ASD) is found to be a major concern among various occupational therapists. The foremost challenge of this neurodevelopmental disorder lies in the fact of analyzing and exploring various symptoms of the children at their early stage of development. Such early identification could prop up the therapists and clinicians to provide proper assistive support to make the children lead an independent life. Facial expressions and emotions perceived by the children could contribute to such early intervention of autism. In this regard, the paper implements in identifying basic facial expression and exploring their emotions upon a time variant factor. The emotions are analyzed by incorporating the facial expression identified through CNN using 68 landmark points plotted on the frontal face with a prediction network formed by RNN known as RCNN-FER system. The paper adopts R-CNN to take the advantage of increased accuracy and performance with decreased time complexity in predicting emotion as a textual network analysis. The papers proves better accuracy in identifying the emotion in autistic children when compared over simple machine learning models built for such identifications contributing to autistic society.
    Derivative-Free Policy Optimization for Linear Risk-Sensitive and Robust Control Design: Implicit Regularization and Sample Complexity. (arXiv:2101.01041v3 [math.OC] UPDATED)
    (0 min) Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.
    ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer. (arXiv:2112.15087v1 [cs.LG])
    (0 min) The analysis of long sequence data remains challenging in many real-world applications. We propose a novel architecture, ChunkFormer, that improves the existing Transformer framework to handle the challenges while dealing with long time series. Original Transformer-based models adopt an attention mechanism to discover global information along a sequence to leverage the contextual data. Long sequential data traps local information such as seasonality and fluctuations in short data sequences. In addition, the original Transformer consumes more resources by carrying the entire attention matrix during the training course. To overcome these challenges, ChunkFormer splits the long sequences into smaller sequence chunks for the attention calculation, progressively applying different chunk sizes in each stage. In this way, the proposed model gradually learns both local and global information without changing the total length of the input sequences. We have extensively tested the effectiveness of this new architecture on different business domains and have proved the advantage of such a model over the existing Transformer-based models.
    Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework. (arXiv:2002.01711v5 [cs.LG] UPDATED)
    (0 min) A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. Major challenges arise in online experiments of two-sided marketplace platforms (e.g., Uber) where there is only one unit that receives a sequence of treatments over time. In those experiments, the treatment at a given time impacts current outcome as well as future outcomes. The aim of this paper is to introduce a reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both simulated data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. A Python implementation of our test is available at https://github.com/callmespring/CausalRL.
    Adversarial Learning for Incentive Optimization in Mobile Payment Marketing. (arXiv:2112.15434v1 [cs.LG])
    (0 min) Many payment platforms hold large-scale marketing campaigns, which allocate incentives to encourage users to pay through their applications. To maximize the return on investment, incentive allocations are commonly solved in a two-stage procedure. After training a response estimation model to estimate the users' mobile payment probabilities (MPP), a linear programming process is applied to obtain the optimal incentive allocation. However, the large amount of biased data in the training set, generated by the previous biased allocation policy, causes a biased estimation. This bias deteriorates the performance of the response model and misleads the linear programming process, dramatically degrading the performance of the resulting allocation policy. To overcome this obstacle, we propose a bias correction adversarial network. Our method leverages the small set of unbiased data obtained under a full-randomized allocation policy to train an unbiased model and then uses it to reduce the bias with adversarial learning. Offline and online experimental results demonstrate that our method outperforms state-of-the-art approaches and significantly improves the performance of the resulting allocation policy in a real-world marketing campaign.
    Studying the Interplay between Information Loss and Operation Loss in Representations for Classification. (arXiv:2112.15238v1 [cs.LG])
    (0 min) Information-theoretic measures have been widely adopted in the design of features for learning and decision problems. Inspired by this, we look at the relationship between i) a weak form of information loss in the Shannon sense and ii) the operation loss in the minimum probability of error (MPE) sense when considering a family of lossy continuous representations (features) of a continuous observation. We present several results that shed light on this interplay. Our first result offers a lower bound on a weak form of information loss as a function of its respective operation loss when adopting a discrete lossy representation (quantization) instead of the original raw observation. From this, our main result shows that a specific form of vanishing information loss (a weak notion of asymptotic informational sufficiency) implies a vanishing MPE loss (or asymptotic operational sufficiency) when considering a general family of lossy continuous representations. Our theoretical findings support the observation that the selection of feature representations that attempt to capture informational sufficiency is appropriate for learning, but this selection is a rather conservative design principle if the intended goal is achieving MPE in classification. Supporting this last point, and under some structural conditions, we show that it is possible to adopt an alternative notion of informational sufficiency (strictly weaker than pure sufficiency in the mutual information sense) to achieve operational sufficiency in learning.
    PINNs for the Solution of the Hyperbolic Buckley-Leverett Problem with a Non-convex Flux Function. (arXiv:2112.14826v1 [physics.flu-dyn])
    (0 min) The displacement of two immiscible fluids is a common problem in fluid flow in porous media. Such a problem can be posed as a partial differential equation (PDE) in what is commonly referred to as a Buckley-Leverett (B-L) problem. The B-L problem is a non-linear hyperbolic conservation law that is known to be notoriously difficult to solve using traditional numerical methods. Here, we address the forward hyperbolic B-L problem with a nonconvex flux function using physics-informed neural networks (PINNs). The contributions of this paper are twofold. First, we present a PINN approach to solve the hyperbolic B-L problem by embedding the Oleinik entropy condition into the neural network residual. We do not use a diffusion term (artificial viscosity) in the residual-loss, but we rely on the strong form of the PDE. Second, we use the Adam optimizer with residual-based adaptive refinement (RAR) algorithm to achieve an ultra-low loss without weighting. Our solution method can accurately capture the shock-front and produce an accurate overall solution. We report a L2 validation error of 2 x 10-2 and a L2 loss of 1x 10-6. The proposed method does not require any additional regularization or weighting of losses to obtain such accurate solution.
    SplitBrain: Hybrid Data and Model Parallel Deep Learning. (arXiv:2112.15317v1 [cs.LG])
    (0 min) The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as convolutional neural networks using model parallelism (as opposed to data parallelism) is challenging because the complex nature of communication between model shards makes it difficult to partition the computation efficiently across multiple machines with an acceptable trade-off. This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism. Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers. A novel scalable group communication is proposed to further improve the training throughput with reduced communication overhead. The results show that SplitBrain can achieve nearly linear speedup while saving up to 67\% of memory consumption for data and model parallel VGG over CIFAR-10.
    Deconfounded Training for Graph Neural Networks. (arXiv:2112.15089v1 [cs.LG])
    (0 min) Learning powerful representations is one central theme of graph neural networks (GNNs). It requires refining the critical information from the input graph, instead of the trivial patterns, to enrich the representations. Towards this end, graph attention and pooling methods prevail. They mostly follow the paradigm of "learning to attend". It maximizes the mutual information between the attended subgraph and the ground-truth label. However, this training paradigm is prone to capture the spurious correlations between the trivial subgraph and the label. Such spurious correlations are beneficial to in-distribution (ID) test evaluations, but cause poor generalization in the out-of-distribution (OOD) test data. In this work, we revisit the GNN modeling from the causal perspective. On the top of our causal assumption, the trivial information serves as a confounder between the critical information and the label, which opens a backdoor path between them and makes them spuriously correlated. Hence, we present a new paradigm of deconfounded training (DTP) that better mitigates the confounding effect and latches on the critical information, to enhance the representation and generalization ability. Specifically, we adopt the attention modules to disentangle the critical subgraph and trivial subgraph. Then we make each critical subgraph fairly interact with diverse trivial subgraphs to achieve a stable prediction. It allows GNNs to capture a more reliable subgraph whose relation with the label is robust across different distributions. We conduct extensive experiments on synthetic and real-world datasets to demonstrate the effectiveness.
    Music-to-Dance Generation with Optimal Transport. (arXiv:2112.01806v1 [cs.SD] CROSS LISTED)
    (0 min) Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion realism, diversity and music consistency. In this paper, we propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographs from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music. This gives a well defined and non-divergent training objective that mitigates the limitation of standard GAN training which is frequently plagued with instability and divergent generator loss issues. Extensive experiments demonstrate that our MDOT-Net can synthesize realistic and diverse dances which achieve an organic unity with the input music, reflecting the shared intentionality and matching the rhythmic articulation.
    Machine Learning Application Development: Practitioners' Insights. (arXiv:2112.15277v1 [cs.SE])
    (0 min) Nowadays, intelligent systems and services are getting increasingly popular as they provide data-driven solutions to diverse real-world problems, thanks to recent breakthroughs in Artificial Intelligence (AI) and Machine Learning (ML). However, machine learning meets software engineering not only with promising potentials but also with some inherent challenges. Despite some recent research efforts, we still do not have a clear understanding of the challenges of developing ML-based applications and the current industry practices. Moreover, it is unclear where software engineering researchers should focus their efforts to better support ML application developers. In this paper, we report about a survey that aimed to understand the challenges and best practices of ML application development. We synthesize the results obtained from 80 practitioners (with diverse skills, experience, and application domains) into 17 findings; outlining challenges and best practices for ML application development. Practitioners involved in the development of ML-based software systems can leverage the summarized best practices to improve the quality of their system. We hope that the reported challenges will inform the research community about topics that need to be investigated to improve the engineering process and the quality of ML-based applications.
    A Unified DRO View of Multi-class Loss Functions with top-N Consistency. (arXiv:2112.14869v1 [cs.LG])
    (0 min) Multi-class classification is one of the most common tasks in machine learning applications, where data is labeled by one of many class labels. Many loss functions have been proposed for multi-class classification including two well-known ones, namely the cross-entropy (CE) loss and the crammer-singer (CS) loss (aka. the SVM loss). While CS loss has been used widely for traditional machine learning tasks, CE loss is usually a default choice for multi-class deep learning tasks. There are also top-$k$ variants of CS loss and CE loss that are proposed to promote the learning of a classifier for achieving better top-$k$ accuracy. Nevertheless, it still remains unclear the relationship between these different losses, which hinders our understanding of their expectations in different scenarios. In this paper, we present a unified view of the CS/CE losses and their smoothed top-$k$ variants by proposing a new family of loss functions, which are arguably better than the CS/CE losses when the given label information is incomplete and noisy. The new family of smooth loss functions named {label-distributionally robust (LDR) loss} is defined by leveraging the distributionally robust optimization (DRO) framework to model the uncertainty in the given label information, where the uncertainty over true class labels is captured by using distributional weights for each label regularized by a function.
    Deep Transfer-Learning for patient specific model re-calibration: Application to sEMG-Classification. (arXiv:2112.15019v1 [cs.LG])
    (0 min) Accurate decoding of surface electromyography (sEMG) is pivotal for muscle-to-machine-interfaces (MMI) and their application for e.g. rehabilitation therapy. sEMG signals have high inter-subject variability, due to various factors, including skin thickness, body fat percentage, and electrode placement. Therefore, obtaining high generalization quality of a trained sEMG decoder is quite challenging. Usually, machine learning based sEMG decoders are either trained on subject-specific data, or at least recalibrated for each user, individually. Even though, deep learning algorithms produced several state of the art results for sEMG decoding,however, due to the limited amount of availability of sEMG data, the deep learning models are prone to overfitting. Recently, transfer learning for domain adaptation improved generalization quality with reduced training time on various machine learning tasks. In this study, we investigate the effectiveness of transfer learning using weight initialization for recalibration of two different pretrained deep learning models on a new subjects data, and compare their performance to subject-specific models. To the best of our knowledge, this is the first study that thoroughly investigated weight-initialization based transfer learning for sEMG classification and compared transfer learning to subject-specific modeling. We tested our models on three publicly available databases under various settings. On average over all settings, our transfer learning approach improves 5~\%-points on the pretrained models without fine-tuning and 12~\%-points on the subject-specific models, while being trained on average 22~\% fewer epochs. Our results indicate that transfer learning enables faster training on fewer samples than user-specific models, and improves the performance of pretrained models as long as enough data is available.
    Neural Hierarchical Factorization Machines for User's Event Sequence Analysis. (arXiv:2112.15292v1 [cs.LG])
    (0 min) Many prediction tasks of real-world applications need to model multi-order feature interactions in user's event sequence for better detection performance. However, existing popular solutions usually suffer two key issues: 1) only focusing on feature interactions and failing to capture the sequence influence; 2) only focusing on sequence information, but ignoring internal feature relations of each event, thus failing to extract a better event representation. In this paper, we consider a two-level structure for capturing the hierarchical information over user's event sequence: 1) learning effective feature interactions based event representation; 2) modeling the sequence representation of user's historical events. Experimental results on both industrial and public datasets clearly demonstrate that our model achieves significantly better performance compared with state-of-the-art baselines.
    Crowd-sensing Enhanced Parking Patrol using Trajectories of Sharing Bikes. (arXiv:2110.15557v2 [cs.LG] UPDATED)
    (0 min) Illegal vehicle parking is a common urban problem faced by major cities in the world, as it incurs traffic jams, which lead to air pollution and traffic accidents. The government highly relies on active human efforts to detect illegal parking events. However, such an approach is extremely ineffective to cover a large city since the police have to patrol over the entire city roads. The massive and high-quality sharing bike trajectories from Mobike offer us a unique opportunity to design a ubiquitous illegal parking detection approach, as most of the illegal parking events happen at curbsides and have significant impact on the bike users. The detection result can guide the patrol schedule, i.e. send the patrol policemen to the region with higher illegal parking risks, and further improve the patrol efficiency. Inspired by this idea, three main components are employed in the proposed framework: 1)~{\em trajectory pre-processing}, which filters outlier GPS points, performs map-matching, and builds trajectory indexes; 2)~{\em illegal parking detection}, which models the normal trajectories, extracts features from the evaluation trajectories, and utilizes a distribution test-based method to discover the illegal parking events; and 3)~{\em patrol scheduling}, which leverages the detection result as reference context, and models the scheduling task as a multi-agent reinforcement learning problem to guide the patrol police. Finally, extensive experiments are presented to validate the effectiveness of illegal parking detection, as well as the improvement of patrol efficiency.
    Towards Robustness of Neural Networks. (arXiv:2112.15188v1 [cs.CV])
    (0 min) We introduce several new datasets namely ImageNet-A/O and ImageNet-R as well as a synthetic environment and testing suite we called CAOS. ImageNet-A/O allow researchers to focus in on the blind spots remaining in ImageNet. ImageNet-R was specifically created with the intention of tracking robust representation as the representations are no longer simply natural but include artistic, and other renditions. The CAOS suite is built off of CARLA simulator which allows for the inclusion of anomalous objects and can create reproducible synthetic environment and scenes for testing robustness. All of the datasets were created for testing robustness and measuring progress in robustness. The datasets have been used in various other works to measure their own progress in robustness and allowing for tangential progress that does not focus exclusively on natural accuracy. Given these datasets, we created several novel methods that aim to advance robustness research. We build off of simple baselines in the form of Maximum Logit, and Typicality Score as well as create a novel data augmentation method in the form of DeepAugment that improves on the aforementioned benchmarks. Maximum Logit considers the logit values instead of the values after the softmax operation, while a small change produces noticeable improvements. The Typicality Score compares the output distribution to a posterior distribution over classes. We show that this improves performance over the baseline in all but the segmentation task. Speculating that perhaps at the pixel level the semantic information of a pixel is less meaningful than that of class level information. Finally the new augmentation technique of DeepAugment utilizes neural networks to create augmentations on images that are radically different than the traditional geometric and camera based transformations used previously.
    G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators. (arXiv:1906.09338v2 [cs.LG] UPDATED)
    (0 min) Recent advances in machine learning have largely benefited from the massive accessible training data. However, large-scale data sharing has raised great privacy concerns. In this work, we propose a novel privacy-preserving data Generative model based on the PATE framework (G-PATE), aiming to train a scalable differentially private data generator that preserves high generated data utility. Our approach leverages generative adversarial nets to generate data, combined with private aggregation among different discriminators to ensure strong privacy guarantees. Compared to existing approaches, G-PATE significantly improves the use of privacy budgets. In particular, we train a student data generator with an ensemble of teacher discriminators and propose a novel private gradient aggregation mechanism to ensure differential privacy on all information that flows from teacher discriminators to the student generator. In addition, with random projection and gradient discretization, the proposed gradient aggregation mechanism is able to effectively deal with high-dimensional gradient vectors. Theoretically, we prove that G-PATE ensures differential privacy for the data generator. Empirically, we demonstrate the superiority of G-PATE over prior work through extensive experiments. We show that G-PATE is the first work being able to generate high-dimensional image data with high data utility under limited privacy budgets ($\epsilon \le 1$). Our code is available at https://github.com/AI-secure/G-PATE.
    Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction. (arXiv:2112.15445v1 [cs.LG])
    (0 min) This work is focused on the pruning of some convolutional neural networks (CNNs) and improving theirs efficiency on graphic processing units (GPU) by using a direct sparse algorithm. The Nvidia deep neural network (cuDnn) library is the most effective implementations of deep learning (DL) algorithms for GPUs. GPUs are the most commonly used accelerators for deep learning computations. One of the most common techniques for improving the efficiency of CNN models is weight pruning and quantization. There are two main types of pruning: structural and non-structural. The first enables much easier acceleration on many type of accelerators, but with this type it is difficult to achieve a sparsity level and accuracy as high as that obtained with the second type. Non-structural pruning with retraining can generate a weight tensors up to 90% or more of sparsity in some deep CNN models. In this article the pruning algorithm is presented which makes it possible to achieve high sparsity levels without accuracy drop. In the next stage the linear and non-linear quantization is adapted for further time and footprint reduction. This paper is an extended of previously published paper concerning effective pruning techniques and present real models pruned with high sparsities and reduced precision which can achieve better performance than the CuDnn library.
    Memory AMP. (arXiv:2012.10861v5 [cs.IT] UPDATED)
    (0 min) Approximate message passing (AMP) is a low-cost iterative parameter-estimation technique for certain high-dimensional linear systems with non-Gaussian distributions. However, AMP only applies to independent identically distributed (IID) transform matrices, but may become unreliable (e.g., perform poorly or even diverge) for other matrix ensembles, especially for ill-conditioned ones. Orthogonal/vector AMP (OAMP/VAMP) was proposed for general right-unitarily-invariant matrices to handle this difficulty. However, the Bayes-optimal OAMP/VAMP (BO-OAMP/VAMP) requires a high-complexity linear minimum mean square error (MMSE) estimator. This limits the application of OAMP/VAMP to large-scale systems. To solve the disadvantages of AMP and BO-OAMP/VAMP, this paper proposes a memory AMP (MAMP) framework under an orthogonality principle, which guarantees the asymptotic IID Gaussianity of estimation errors in MAMP. We present an orthogonalization procedure for the local memory estimators to realize the required orthogonality for MAMP. Furthermore, we propose a Bayes-optimal MAMP (BO-MAMP), in which a long-memory matched filter is proposed for interference suppression. The complexity of BO-MAMP is comparable to AMP. A state evolution is derived to asymptotically characterize the performance of BO-MAMP. Based on state evolution, the relaxation parameters and damping vector in BO-MAMP are optimized. For all right-unitarily-invariant matrices, the state evolution of the optimized BO-MAMP converges to the same fixed point as that of the high-complexity BO-OAMP/VAMP and is Bayes-optimal if its state evolution has a unique fixed point. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. (arXiv:2106.11612v2 [cs.LG] UPDATED)
    (0 min) We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.
    Investigating and Modeling the Dynamics of Long Ties. (arXiv:2109.10523v2 [cs.SI] UPDATED)
    (0 min) Long ties, the social ties that bridge different communities, are widely believed to play crucial roles in spreading novel information in social networks. However, some existing network theories and prediction models indicate that long ties might dissolve quickly or eventually become redundant, thus putting into question the long-term value of long ties. Our empirical analysis of real-world dynamic networks shows that contrary to such reasoning, long ties are more likely to persist than other social ties, and that many of them constantly function as social bridges without being embedded in local networks. Using a novel cost-benefit analysis model combined with machine learning, we show that long ties are highly beneficial, which instinctively motivates people to expend extra effort to maintain them. This partly explains why long ties are more persistent than what has been suggested by many existing theories and models. Overall, our study suggests the need for social interventions that can promote the formation of long ties, such as mixing people with diverse backgrounds.
    Fairness-Oriented User Scheduling for Bursty Downlink Transmission Using Multi-Agent Reinforcement Learning. (arXiv:2012.15081v13 [cs.OS] UPDATED)
    (0 min) In this work, we develop practical user scheduling algorithms for downlink bursty traffic with emphasis on user fairness. In contrast to the conventional scheduling algorithms that either equally divides the transmission time slots among users or maximizing some ratios without physcial meanings, we propose to use the 5%-tile user data rate (5TUDR) as the metric to evaluate user fairness. Since it is difficult to directly optimize 5TUDR, we first cast the problem into the stochastic game framework and subsequently propose a Multi-Agent Reinforcement Learning (MARL)-based algorithm to perform distributed optimization on the resource block group (RBG) allocation. Furthermore, each MARL agent is designed to take information measured by network counters from multiple network layers (e.g. Channel Quality Indicator, Buffer size) as the input states while the RBG allocation as action with a proposed reward function designed to maximize 5TUDR. Extensive simulation is performed to show that the proposed MARL-based scheduler can achieve fair scheduling while maintaining good average network throughput as compared to conventional schedulers.
    Motif Graph Neural Network. (arXiv:2112.14900v1 [cs.LG])
    (0 min) Graphs can model complicated interactions between entities, which naturally emerge in many important applications. These applications can often be cast into standard graph learning tasks, in which a crucial step is to learn low-dimensional graph representations. Graph neural networks (GNNs) are currently the most popular model in graph embedding approaches. However, standard GNNs in the neighborhood aggregation paradigm suffer from limited discriminative power in distinguishing \emph{high-order} graph structures as opposed to \emph{low-order} structures. To capture high-order structures, researchers have resorted to motifs and developed motif-based GNNs. However, existing motif-based GNNs still often suffer from less discriminative power on high-order structures. To overcome the above limitations, we propose Motif Graph Neural Network (MGNN), a novel framework to better capture high-order structures, hinging on our proposed motif redundancy minimization operator and injective motif combination. First, MGNN produces a set of node representations w.r.t. each motif. The next phase is our proposed redundancy minimization among motifs which compares the motifs with each other and distills the features unique to each motif. Finally, MGNN performs the updating of node representations by combining multiple representations from different motifs. In particular, to enhance the discriminative power, MGNN utilizes an injective function to combine the representations w.r.t. different motifs. We further show that our proposed architecture increases the expressive power of GNNs with a theoretical analysis. We demonstrate that MGNN outperforms state-of-the-art methods on seven public benchmarks on both node classification and graph classification tasks.
    SPViT: Enabling Faster Vision Transformers via Soft Token Pruning. (arXiv:2112.13890v1 [cs.CV] CROSS LISTED)
    (0 min) Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens generated by the selector module into a package token that will participate in subsequent calculations rather than being completely discarded. Our framework is bound to the trade-off between accuracy and computation constraints of specific edge devices through our proposed computation-aware training strategy. Experimental results show that our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification. Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms. For example, our method reduces the latency of DeiT-T to 26 ms (26%$\sim $41% superior to existing works) on the mobile device with 0.25%$\sim $4% higher top-1 accuracy on ImageNet. Our code will be released soon.
    Weakly Supervised Change Detection Using Guided Anisotropic Difusion. (arXiv:2112.15367v1 [eess.IV])
    (0 min) Large scale datasets created from crowdsourced labels or openly available data have become crucial to provide training data for large scale learning algorithms. While these datasets are easier to acquire, the data are frequently noisy and unreliable, which is motivating research on weakly supervised learning techniques. In this paper we propose original ideas that help us to leverage such datasets in the context of change detection. First, we propose the guided anisotropic diffusion (GAD) algorithm, which improves semantic segmentation results using the input images as guides to perform edge preserving filtering. We then show its potential in two weakly-supervised learning strategies tailored for change detection. The first strategy is an iterative learning method that combines model optimisation and data cleansing using GAD to extract the useful information from a large scale change detection dataset generated from open vector data. The second one incorporates GAD within a novel spatial attention layer that increases the accuracy of weakly supervised networks trained to perform pixel-level predictions from image-level labels. Improvements with respect to state-of-the-art are demonstrated on 4 different public datasets.
    Recent Trends in Artificial Intelligence-inspired Electronic Thermal Management. (arXiv:2112.14837v1 [cs.LG])
    (0 min) The rise of computation-based methods in thermal management has gained immense attention in recent years due to the ability of deep learning to solve complex 'physics' problems, which are otherwise difficult to be approached using conventional techniques. Thermal management is required in electronic systems to keep them from overheating and burning, enhancing their efficiency and lifespan. For a long time, numerical techniques have been employed to aid in the thermal management of electronics. However, they come with some limitations. To increase the effectiveness of traditional numerical approaches and address the drawbacks faced in conventional approaches, researchers have looked at using artificial intelligence at various stages of the thermal management process. The present study discusses in detail, the current uses of deep learning in the domain of 'electronic' thermal management.
    What is Event Knowledge Graph: A Survey. (arXiv:2112.15280v1 [cs.LG])
    (0 min) Besides entity-centric knowledge, usually organized as Knowledge Graph (KG), events are also an essential kind of knowledge in the world, which trigger the spring up of event-centric knowledge representation form like Event KG (EKG). It plays an increasingly important role in many machine learning and artificial intelligence applications, such as intelligent search, question-answering, recommendation, and text generation. This paper provides a comprehensive survey of EKG from history, ontology, instance, and application views. Specifically, to characterize EKG thoroughly, we focus on its history, definitions, schema induction, acquisition, related representative graphs/systems, and applications. The development processes and trends are studied therein. We further summarize perspective directions to facilitate future research on EKG.
    K-Core Decomposition on Super Large Graphs with Limited Resources. (arXiv:2112.14840v1 [cs.DC])
    (0 min) K-core decomposition is a commonly used metric to analyze graph structure or study the relative importance of nodes in complex graphs. Recent years have seen rapid growth in the scale of the graph, especially in industrial settings. For example, our industrial partner runs popular social applications with billions of users and is able to gather a rich set of user data. As a result, applying K-core decomposition on large graphs has attracted more and more attention from academics and the industry. A simple but effective method to deal with large graphs is to train them in the distributed settings, and some distributed K-core decomposition algorithms are also proposed. Despite their effectiveness, we experimentally and theoretically observe that these algorithms consume too many resources and become unstable on super-large-scale graphs, especially when the given resources are limited. In this paper, we deal with those super-large-scale graphs and propose a divide-and-conquer strategy on top of the distributed K-core decomposition algorithm. We evaluate our approach on three large graphs. The experimental results show that the consumption of resources can be significantly reduced, and the calculation on large-scale graphs becomes more stable than the existing methods. For example, the distributed K-core decomposition algorithm can scale to a large graph with 136 billion edges without losing correctness with our divide-and-conquer technique.
    Persformer: A Transformer Architecture for Topological Machine Learning. (arXiv:2112.15210v1 [cs.LG])
    (0 min) One of the main challenges of Topological Data Analysis (TDA) is to extract features from persistent diagrams directly usable by machine learning algorithms. Indeed, persistence diagrams are intrinsically (multi-)sets of points in R2 and cannot be seen in a straightforward manner as vectors. In this article, we introduce Persformer, the first Transformer neural network architecture that accepts persistence diagrams as input. The Persformer architecture significantly outperforms previous topological neural network architectures on classical synthetic benchmark datasets. Moreover, it satisfies a universal approximation theorem. This allows us to introduce the first interpretability method for topological machine learning, which we explore in two examples.
    Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning. (arXiv:2112.15110v1 [cs.SD])
    (0 min) Could we automatically derive the score of a piano accompaniment based on the audio of a pop song? This is the audio-to-symbolic arrangement problem we tackle in this paper. A good arrangement model should not only consider the audio content but also have prior knowledge of piano composition (so that the generation "sounds like" the audio and meanwhile maintains musicality.) To this end, we contribute a cross-modal representation-learning model, which 1) extracts chord and melodic information from the audio, and 2) learns texture representation from both audio and a corrupted ground truth arrangement. We further introduce a tailored training strategy that gradually shifts the source of texture information from corrupted score to audio. In the end, the score-based texture posterior is reduced to a standard normal distribution, and only audio is needed for inference. Experiments show that our model captures major audio information and outperforms baselines in generation quality.
    Mythological Medical Machine Learning: Boosting the Performance of a Deep Learning Medical Data Classifier Using Realistic Physiological Models. (arXiv:2112.15442v1 [cs.LG])
    (0 min) Objective: To determine if a realistic, but computationally efficient model of the electrocardiogram can be used to pre-train a deep neural network (DNN) with a wide range of morphologies and abnormalities specific to a given condition - T-wave Alternans (TWA) as a result of Post-Traumatic Stress Disorder, or PTSD - and significantly boost performance on a small database of rare individuals. Approach: Using a previously validated artificial ECG model, we generated 180,000 artificial ECGs with or without significant TWA, with varying heart rate, breathing rate, TWA amplitude, and ECG morphology. A DNN, trained on over 70,000 patients to classify 25 different rhythms, was modified the output layer to a binary class (TWA or no-TWA, or equivalently, PTSD or no-PTSD), and transfer learning was performed on the artificial ECG. In a final transfer learning step, the DNN was trained and cross-validated on ECG from 12 PTSD and 24 controls for all combinations of using the three databases. Main results: The best performing approach (AUROC = 0.77, Accuracy = 0.72, F1-score = 0.64) was found by performing both transfer learning steps, using the pre-trained arrhythmia DNN, the artificial data and the real PTSD-related ECG data. Removing the artificial data from training led to the largest drop in performance. Removing the arrhythmia data from training provided a modest, but significant, drop in performance. The final model showed no significant drop in performance on the artificial data, indicating no overfitting. Significance: In healthcare, it is common to only have a small collection of high-quality data and labels, or a larger database with much lower quality (and less relevant) labels. The paradigm presented here, involving model-based performance boosting, provides a solution through transfer learning on a large realistic artificial database, and a partially relevant real database.
    GANISP: a GAN-assisted Importance SPlitting Probability Estimator. (arXiv:2112.15444v1 [cs.LG])
    (0 min) Designing manufacturing processes with high yield and strong reliability relies on effective methods for rare event estimation. Genealogical importance splitting reduces the variance of rare event probability estimators by iteratively selecting and replicating realizations that are headed towards a rare event. The replication step is difficult when applied to deterministic systems where the initial conditions of the offspring realizations need to be modified. Typically, a random perturbation is applied to the offspring to differentiate their trajectory from the parent realization. However, this random perturbation strategy may be effective for some systems while failing for others, preventing variance reduction in the probability estimate. This work seeks to address this limitation using a generative model such as a Generative Adversarial Network (GAN) to generate perturbations that are consistent with the attractor of the dynamical system. The proposed GAN-assisted Importance SPlitting method (GANISP) improves the variance reduction for the system targeted. An implementation of the method is available in a companion repository (https://github.com/NREL/GANISP).
    Data-Free Knowledge Transfer: A Survey. (arXiv:2112.15278v1 [cs.LG])
    (0 min) In the last decade, many deep learning models have been well trained and made a great success in various fields of machine intelligence, especially for computer vision and natural language processing. To better leverage the potential of these well-trained models in intra-domain or cross-domain transfer learning situations, knowledge distillation (KD) and domain adaptation (DA) are proposed and become research highlights. They both aim to transfer useful information from a well-trained model with original training data. However, the original data is not always available in many cases due to privacy, copyright or confidentiality. Recently, the data-free knowledge transfer paradigm has attracted appealing attention as it deals with distilling valuable knowledge from well-trained models without requiring to access to the training data. In particular, it mainly consists of the data-free knowledge distillation (DFKD) and source data-free domain adaptation (SFDA). On the one hand, DFKD aims to transfer the intra-domain knowledge of original data from a cumbersome teacher network to a compact student network for model compression and efficient inference. On the other hand, the goal of SFDA is to reuse the cross-domain knowledge stored in a well-trained source model and adapt it to a target domain. In this paper, we provide a comprehensive survey on data-free knowledge transfer from the perspectives of knowledge distillation and unsupervised domain adaptation, to help readers have a better understanding of the current research status and ideas. Applications and challenges of the two areas are briefly reviewed, respectively. Furthermore, we provide some insights to the subject of future research.
    A Graph Attention Learning Approach to Antenna Tilt Optimization. (arXiv:2112.14843v1 [cs.LG])
    (0 min) 6G will move mobile networks towards increasing levels of complexity. To deal with this complexity, optimization of network parameters is key to ensure high performance and timely adaptivity to dynamic network environments. The optimization of the antenna tilt provides a practical and cost-efficient method to improve coverage and capacity in the network. Previous methods based on Reinforcement Learning (RL) have shown great promise for tilt optimization by learning adaptive policies outperforming traditional tilt optimization methods. However, most existing RL methods are based on single-cell features representation, which fails to fully characterize the agent state, resulting in suboptimal performance. Also, most of such methods lack scalability, due to state-action explosion, and generalization ability. In this paper, we propose a Graph Attention Q-learning (GAQ) algorithm for tilt optimization. GAQ relies on a graph attention mechanism to select relevant neighbors information, improve the agent state representation, and update the tilt control policy based on a history of observations using a Deep Q-Network (DQN). We show that GAQ efficiently captures important network information and outperforms standard DQN with local information by a large margin. In addition, we demonstrate its ability to generalize to network deployments of different sizes and densities.
    A theory of independent mechanisms for extrapolation in generative models. (arXiv:2004.00184v2 [cs.LG] UPDATED)
    (0 min) Generative models can be trained to emulate complex empirical data, but are they useful to make predictions in the context of previously unobserved environments? An intuitive idea to promote such extrapolation capabilities is to have the architecture of such model reflect a causal graph of the true data generating process, such that one can intervene on each node independently of the others. However, the nodes of this graph are usually unobserved, leading to overparameterization and lack of identifiability of the causal structure. We develop a theoretical framework to address this challenging situation by defining a weaker form of identifiability, based on the principle of independence of mechanisms. We demonstrate on toy examples that classical stochastic gradient descent can hinder the model's extrapolation capabilities, suggesting independence of mechanisms should be enforced explicitly during training. Experiments on deep generative models trained on real world data support these insights and illustrate how the extrapolation capabilities of such models can be leveraged.
    Bayesian Optimization of Function Networks. (arXiv:2112.15311v1 [cs.LG])
    (2 min) We consider Bayesian optimization of the output of a network of functions, where each function takes as input the output of its parent nodes, and where the network takes significant time to evaluate. Such problems arise, for example, in reinforcement learning, engineering design, and manufacturing. While the standard Bayesian optimization approach observes only the final output, our approach delivers greater query efficiency by leveraging information that the former ignores: intermediate output within the network. This is achieved by modeling the nodes of the network using Gaussian processes and choosing the points to evaluate using, as our acquisition function, the expected improvement computed with respect to the implied posterior on the objective. Although the non-Gaussian nature of this posterior prevents computing our acquisition function in closed form, we show that it can be efficiently maximized via sample average approximation. In addition, we prove that our method is asymptotically consistent, meaning that it finds a globally optimal solution as the number of evaluations grows to infinity, thus generalizing previously known convergence results for the expected improvement. Notably, this holds even though our method might not evaluate the domain densely, instead leveraging problem structure to leave regions unexplored. Finally, we show that our approach dramatically outperforms standard Bayesian optimization methods in several synthetic and real-world problems.
    Settling the Bias and Variance of Meta-Gradient Estimation for Meta-Reinforcement Learning. (arXiv:2112.15400v1 [cs.LG])
    (0 min) In recent years, gradient based Meta-RL (GMRL) methods have achieved remarkable successes in either discovering effective online hyperparameter for one single task (Xu et al., 2018) or learning good initialisation for multi-task transfer learning (Finn et al., 2017). Despite the empirical successes, it is often neglected that computing meta gradients via vanilla backpropagation is ill-defined. In this paper, we argue that the stochastic meta-gradient estimation adopted by many existing MGRL methods are in fact biased; the bias comes from two sources: 1) the compositional bias that is inborn in the structure of compositional optimisation problems and 2) the bias of multi-step Hessian estimation caused by direct automatic differentiation. To better understand the meta gradient biases, we perform the first of its kind study to quantify the amount for each of them. We start by providing a unifying derivation for existing GMRL algorithms, and then theoretically analyse both the bias and the variance of existing gradient estimation methods. On understanding the underlying principles of bias, we propose two mitigation solutions based on off-policy correction and multi-step Hessian estimation techniques. Comprehensive ablation studies have been conducted and results reveals: (1) The existence of these two biases and how they influence the meta-gradient estimation when combined with different estimator/sample size/step and learning rate. (2) The effectiveness of these mitigation approaches for meta-gradient estimation and thereby the final return on two practical Meta-RL algorithms: LOLA-DiCE and Meta-gradient Reinforcement Learning.
    On the Role of Neural Collapse in Transfer Learning. (arXiv:2112.15121v1 [cs.LG])
    (2 min) We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. In this paper we provide an explanation for this behavior based on the recently observed phenomenon that the features learned by overparameterized classification networks show an interesting clustering property, called neural collapse. We demonstrate both theoretically and empirically that neural collapse generalizes to new samples from the training classes, and -- more importantly -- to new classes as well, allowing foundation models to provide feature maps that work well in transfer learning and, specifically, in the few-shot setting.
    Deep Learning Models for Knowledge Tracing: Review and Empirical Evaluation. (arXiv:2112.15072v1 [cs.LG])
    (2 min) In this work, we review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets, and with a novel data set of students learning to program. The evaluated DLKT models have been reimplemented for assessing reproducibility and replicability of previously reported results. We test different input and output layer variations found in the compared models that are independent of the main architectures of the models, and different maximum attempt count options that have been implicitly and explicitly used in some studies. Several metrics are used to reflect on the quality of the evaluated knowledge tracing models. The evaluated knowledge tracing models include Vanilla-DKT, two Long Short-Term Memory Deep Knowledge Tracing (LSTM-DKT) variants, two Dynamic Key-Value Memory Network (DKVMN) variants, and Self-Attentive Knowledge Tracing (SAKT). We evaluate logistic regression, Bayesian Knowledge Tracing (BKT) and simple non-learning models as baselines. Our results suggest that the DLKT models in general outperform non-DLKT models, and the relative differences between the DLKT models are subtle and often vary between datasets. Our results also show that naive models such as mean prediction can yield better performance than more sophisticated knowledge tracing models, especially in terms of accuracy. Further, our metric and hyperparameter analysis shows that the metric used to select the best model hyperparameters has a noticeable effect on the performance of the models, and that metric choice can affect model ranking. We also study the impact of input and output layer variations, filtering out long attempt sequences, and non-model properties such as randomness and hardware. Finally, we discuss model performance replicability and related issues. Our model implementations, evaluation code, and data are published as a part of this work.
    Reversible Upper Confidence Bound Algorithm to Generate Diverse Optimized Candidates. (arXiv:2112.14893v1 [cs.LG])
    (2 min) Most algorithms for the multi-armed bandit problem in reinforcement learning aimed to maximize the expected reward, which are thus useful in searching the optimized candidate with the highest reward (function value) for diverse applications (e.g., AlphaGo). However, in some typical application scenaios such as drug discovery, the aim is to search a diverse set of candidates with high reward. Here we propose a reversible upper confidence bound (rUCB) algorithm for such a purpose, and demonstrate its application in virtual screening upon intrinsically disordered proteins (IDPs). It is shown that rUCB greatly reduces the query times while achieving both high accuracy and low performance loss.The rUCB may have potential application in multipoint optimization and other reinforcement-learning cases.
    Learned Coarse Models for Efficient Turbulence Simulation. (arXiv:2112.15275v1 [physics.flu-dyn])
    (0 min) Turbulence simulation with classical numerical solvers requires very high-resolution grids to accurately resolve dynamics. Here we train learned simulators at low spatial and temporal resolutions to capture turbulent dynamics generated at high resolution. We show that our proposed model can simulate turbulent dynamics more accurately than classical numerical solvers at the same low resolutions across various scientifically relevant metrics. Our model is trained end-to-end from data and is capable of learning a range of challenging chaotic and turbulent dynamics at low resolution, including trajectories generated by the state-of-the-art Athena++ engine. We show that our simpler, general-purpose architecture outperforms various more specialized, turbulence-specific architectures from the learned turbulence simulation literature. In general, we see that learned simulators yield unstable trajectories; however, we show that tuning training noise and temporal downsampling solves this problem. We also find that while generalization beyond the training distribution is a challenge for learned models, training noise, convolutional architectures, and added loss constraints can help. Broadly, we conclude that our learned simulator outperforms traditional solvers run on coarser grids, and emphasize that simple design choices can offer stability and robust generalization.
    Deep Graph Clustering via Dual Correlation Reduction. (arXiv:2112.14772v1 [cs.LG])
    (0 min) Deep graph clustering, which aims to reveal the underlying graph structure and divide the nodes into different groups, has attracted intensive attention in recent years. However, we observe that, in the process of node encoding, existing methods suffer from representation collapse which tends to map all data into the same representation. Consequently, the discriminative capability of the node representation is limited, leading to unsatisfied clustering performance. To address this issue, we propose a novel self-supervised deep graph clustering method termed Dual Correlation Reduction Network (DCRN) by reducing information correlation in a dual manner. Specifically, in our method, we first design a siamese network to encode samples. Then by forcing the cross-view sample correlation matrix and cross-view feature correlation matrix to approximate two identity matrices, respectively, we reduce the information correlation in the dual-level, thus improving the discriminative capability of the resulting features. Moreover, in order to alleviate representation collapse caused by over-smoothing in GCN, we introduce a propagation regularization term to enable the network to gain long-distance information with the shallow network structure. Extensive experimental results on six benchmark datasets demonstrate the effectiveness of the proposed DCRN against the existing state-of-the-art methods.
    Active Learning-Based Optimization of Scientific Experimental Design. (arXiv:2112.14811v1 [cs.LG])
    (0 min) Active learning (AL) is a machine learning algorithm that can achieve greater accuracy with fewer labeled training instances, for having the ability to ask oracles to label the most valuable unlabeled data chosen iteratively and heuristically by query strategies. Scientific experiments nowadays, though becoming increasingly automated, are still suffering from human involvement in the designing process and the exhaustive search in the experimental space. This article performs a retrospective study on a drug response dataset using the proposed AL scheme comprised of the matrix factorization method of alternating least square (ALS) and deep neural networks (DNN). This article also proposes an AL query strategy based on expected loss minimization. As a result, the retrospective study demonstrates that scientific experimental design, instead of being manually set, can be optimized by AL, and the proposed query strategy ELM sampling shows better experimental performance than other ones such as random sampling and uncertainty sampling.
    Digital Rock Typing DRT Algorithm Formulation with Optimal Supervised Semantic Segmentation. (arXiv:2112.15068v1 [cs.LG])
    (0 min) Each grid block in a 3D geological model requires a rock type that represents all physical and chemical properties of that block. The properties that classify rock types are lithology, permeability, and capillary pressure. Scientists and engineers determined these properties using conventional laboratory measurements, which embedded destructive methods to the sample or altered some of its properties (i.e., wettability, permeability, and porosity) because the measurements process includes sample crushing, fluid flow, or fluid saturation. Lately, Digital Rock Physics (DRT) has emerged to quantify these properties from micro-Computerized Tomography (uCT) and Magnetic Resonance Imaging (MRI) images. However, the literature did not attempt rock typing in a wholly digital context. We propose performing Digital Rock Typing (DRT) by: (1) integrating the latest DRP advances in a novel process that honors digital rock properties determination, while; (2) digitalizing the latest rock typing approaches in carbonate, and (3) introducing a novel carbonate rock typing process that utilizes computer vision capabilities to provide more insight about the heterogeneous carbonate rock texture.
    Retrieving Black-box Optimal Images from External Databases. (arXiv:2112.14921v1 [cs.IR])
    (2 min) Suppose we have a black-box function (e.g., deep neural network) that takes an image as input and outputs a value that indicates preference. How can we retrieve optimal images with respect to this function from an external database on the Internet? Standard retrieval problems in the literature (e.g., item recommendations) assume that an algorithm has full access to the set of items. In other words, such algorithms are designed for service providers. In this paper, we consider the retrieval problem under different assumptions. Specifically, we consider how users with limited access to an image database can retrieve images using their own black-box functions. This formulation enables a flexible and finer-grained image search defined by each user. We assume the user can access the database through a search query with tight API limits. Therefore, a user needs to efficiently retrieve optimal images in terms of the number of queries. We propose an efficient retrieval algorithm Tiara for this problem. In the experiments, we confirm that our proposed method performs better than several baselines under various settings.
    Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size. (arXiv:2112.14872v1 [math.OC])
    (0 min) Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice. With the increasing popularity of deep learning over the past decade, stochastic gradient descent and its adaptive variants (e.g. Adagrad, Adam, etc.) have become prominent methods of choice for machine learning practitioners. While a large number of works have demonstrated that these first order optimization methods can achieve sub-linear or linear convergence, we establish local quadratic convergence for stochastic gradient descent with adaptive step size for problems such as matrix inversion.
    RheFrameDetect: A Text Classification System for Automatic Detection of Rhetorical Frames in AI from Open Sources. (arXiv:2112.14933v1 [cs.CL])
    (0 min) Rhetorical Frames in AI can be thought of as expressions that describe AI development as a competition between two or more actors, such as governments or companies. Examples of such Frames include robotic arms race, AI rivalry, technological supremacy, cyberwarfare dominance and 5G race. Detection of Rhetorical Frames from open sources can help us track the attitudes of governments or companies towards AI, specifically whether attitudes are becoming more cooperative or competitive over time. Given the rapidly increasing volumes of open sources (online news media, twitter, blogs), it is difficult for subject matter experts to identify Rhetorical Frames in (near) real-time. Moreover, these sources are in general unstructured (noisy) and therefore, detecting Frames from these sources will require state-of-the-art text classification techniques. In this paper, we develop RheFrameDetect, a text classification system for (near) real-time capture of Rhetorical Frames from open sources. Given an input document, RheFrameDetect employs text classification techniques at multiple levels (document level and paragraph level) to identify all occurrences of Frames used in the discussion of AI. We performed extensive evaluation of the text classification techniques used in RheFrameDetect against human annotated Frames from multiple news sources. To further demonstrate the effectiveness of RheFrameDetect, we show multiple case studies depicting the Frames identified by RheFrameDetect compared against human annotated Frames.
    Training Recurrent Neural Networks by Sequential Least Squares and the Alternating Direction Method of Multipliers. (arXiv:2112.15348v1 [cs.LG])
    (0 min) For training recurrent neural network models of nonlinear dynamical systems from an input/output training dataset based on rather arbitrary convex and twice-differentiable loss functions and regularization terms, we propose the use of sequential least squares for determining the optimal network parameters and hidden states. In addition, to handle non-smooth regularization terms such as L1, L0, and group-Lasso regularizers, as well as to impose possibly non-convex constraints such as integer and mixed-integer constraints, we combine sequential least squares with the alternating direction method of multipliers (ADMM). The performance of the resulting algorithm, that we call NAILS (Nonconvex ADMM Iterations and Least Squares), is tested in a nonlinear system identification benchmark.
    Development of a face mask detection pipeline for mask-wearing monitoring in the era of the COVID-19 pandemic: A modular approach. (arXiv:2112.15031v1 [cs.CV])
    (2 min) During the SARS-Cov-2 pandemic, mask-wearing became an effective tool to prevent spreading and contracting the virus. The ability to monitor the mask-wearing rate in the population would be useful for determining public health strategies against the virus. However, artificial intelligence technologies for detecting face masks have not been deployed at a large scale in real-life to measure the mask-wearing rate in public. In this paper, we present a two-step face mask detection approach consisting of two separate modules: 1) face detection and alignment and 2) face mask classification. This approach allowed us to experiment with different combinations of face detection and face mask classification modules. More specifically, we experimented with PyramidKey and RetinaFace as face detectors while maintaining a lightweight backbone for the face mask classification module. Moreover, we also provide a relabeled annotation of the test set of the AIZOO dataset, where we rectified the incorrect labels for some face images. The evaluation results on the AIZOO and Moxa 3K datasets showed that the proposed face mask detection pipeline surpassed the state-of-the-art methods. The proposed pipeline also yielded a higher mAP on the relabeled test set of the AIZOO dataset than the original test set. Since we trained the proposed model using in-the-wild face images, we can successfully deploy our model to monitor the mask-wearing rate using public CCTV images.
    Domain Adaptation with Category Attention Network for Deep Sentiment Analysis. (arXiv:2112.15290v1 [cs.CL])
    (0 min) Domain adaptation tasks such as cross-domain sentiment classification aim to utilize existing labeled data in the source domain and unlabeled or few labeled data in the target domain to improve the performance in the target domain via reducing the shift between the data distributions. Existing cross-domain sentiment classification methods need to distinguish pivots, i.e., the domain-shared sentiment words, and non-pivots, i.e., the domain-specific sentiment words, for excellent adaptation performance. In this paper, we first design a Category Attention Network (CAN), and then propose a model named CAN-CNN to integrate CAN and a Convolutional Neural Network (CNN). On the one hand, the model regards pivots and non-pivots as unified category attribute words and can automatically capture them to improve the domain adaptation performance; on the other hand, the model makes an attempt at interpretability to learn the transferred category attribute words. Specifically, the optimization objective of our model has three different components: 1) the supervised classification loss; 2) the distributions loss of category feature weights; 3) the domain invariance loss. Finally, the proposed model is evaluated on three public sentiment analysis datasets and the results demonstrate that CAN-CNN can outperform other various baseline methods.
    Graph Neural Networks for Communication Networks: Context, Use Cases and Opportunities. (arXiv:2112.14792v1 [cs.NI])
    (2 min) Graph neural networks (GNN) have shown outstanding applications in many fields where data is fundamentally represented as graphs (e.g., chemistry, biology, recommendation systems). In this vein, communication networks comprise many fundamental components that are naturally represented in a graph-structured manner (e.g., topology, configurations, traffic flows). This position article presents GNNs as a fundamental tool for modeling, control and management of communication networks. GNNs represent a new generation of data-driven models that can accurately learn and reproduce the complex behaviors behind real networks. As a result, such models can be applied to a wide variety of networking use cases, such as planning, online optimization, or troubleshooting. The main advantage of GNNs over traditional neural networks lies in its unprecedented generalization capabilities when applied to other networks and configurations unseen during training, which is a critical feature for achieving practical data-driven solutions for networking. This article comprises a brief tutorial on GNNs and their possible applications to communication networks. To showcase the potential of this technology, we present two use cases with state-of-the-art GNN models respectively applied to wired and wireless networks. Lastly, we delve into the key open challenges and opportunities yet to be explored in this novel research area.
    Transfer learning for cancer diagnosis in histopathological images. (arXiv:2112.15523v1 [eess.IV])
    (0 min) Transfer learning allows us to exploit knowledge gained from one task to assist in solving another but relevant task. In modern computer vision research, the question is which architecture performs better for a given dataset. In this paper, we compare the performance of 14 pre-trained ImageNet models on the histopathologic cancer detection dataset, where each model has been configured as a naive model, feature extractor model, or fine-tuned model. Densenet161 has been shown to have high precision whilst Resnet101 has a high recall. A high precision model is suitable to be used when follow-up examination cost is high, whilst low precision but a high recall/sensitivity model can be used when the cost of follow-up examination is low. Results also show that transfer learning helps to converge a model faster.
    Explainable Signature-based Machine Learning Approach for Identification of Faults in Grid-Connected Photovoltaic Systems. (arXiv:2112.14842v1 [cs.LG])
    (0 min) The transformation of conventional power networks into smart grids with the heavy penetration level of renewable energy resources, particularly grid-connected Photovoltaic (PV) systems, has increased the need for efficient fault identification systems. Malfunctioning any single component in grid-connected PV systems may lead to grid instability and other serious consequences, showing that a reliable fault identification system is the utmost requirement for ensuring operational integrity. Therefore, this paper presents a novel fault identification approach based on statistical signatures of PV operational states. These signatures are unique because each fault has a different nature and distinctive impact on the electrical system. Thus, the Random Forest Classifier trained on these extracted signatures showed 100% accuracy in identifying all types of faults. Furthermore, the performance comparison of the proposed framework with other Machine Learning classifiers depicts its credibility. Moreover, to elevate user trust in the predicted outcomes, SHAP (Shapley Additive Explanation) was utilized during the training phase to extract a complete model response (global explanation). This extracted global explanation can help in the assessment of predicted outcomes credibility by decoding each prediction in terms of features contribution. Hence, the proposed explainable signature-based fault identification technique is highly credible and fulfills all the requirements of smart grids.
    Robust Entropy-regularized Markov Decision Processes. (arXiv:2112.15364v1 [cs.LG])
    (0 min) Stochastic and soft optimal policies resulting from entropy-regularized Markov decision processes (ER-MDP) are desirable for exploration and imitation learning applications. Motivated by the fact that such policies are sensitive with respect to the state transition probabilities, and the estimation of these probabilities may be inaccurate, we study a robust version of the ER-MDP model, where the stochastic optimal policies are required to be robust with respect to the ambiguity in the underlying transition probabilities. Our work is at the crossroads of two important schemes in reinforcement learning (RL), namely, robust MDP and entropy regularized MDP. We show that essential properties that hold for the non-robust ER-MDP and robust unregularized MDP models also hold in our settings, making the robust ER-MDP problem tractable. We show how our framework and results can be integrated into different algorithmic schemes including value or (modified) policy iteration, which would lead to new robust RL and inverse RL algorithms to handle uncertainties. Analyses on computational complexity and error propagation under conventional uncertainty settings are also provided.
    Control Theoretic Analysis of Temporal Difference Learning. (arXiv:2112.14417v2 [cs.AI] UPDATED)
    (0 min) The goal of this paper is to investigate a control theoretic analysis of linear stochastic iterative algorithm and temporal difference (TD) learning. TD-learning is a linear stochastic iterative algorithm to estimate the value function of a given policy for a Markov decision process, which is one of the most popular and fundamental reinforcement learning algorithms. While there has been a series of successful works in theoretical analysis of TD-learning, it was not until recently that researchers found some guarantees on its statistical efficiency. In this paper, we propose a control theoretic finite-time analysis TD-learning, which exploits standard notions in linear system control communities. Therefore, the proposed work provides additional insights on TD-learning and reinforcement learning with simple concepts and analysis tools in control theory.
    Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates. (arXiv:2112.15025v1 [cs.LG])
    (0 min) We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforcement learning tasks with no or very little new data. Specifically, we consider the framework of generalized policy evaluation and improvement, in which the rewards for all tasks of interest are assumed to be expressible as a linear combination of a fixed set of features. We show theoretically that, under certain assumptions, having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance on all possible downstream tasks which are typically more complex than the ones on which the agent was trained. Based on this theoretical analysis, we propose a simple algorithm that iteratively constructs this set of policies. In addition to empirically validating our theoretical results, we compare our approach with recently proposed diverse policy set construction methods and show that, while others fail, our approach is able to build a behavior basis that enables instantaneous transfer to all possible downstream tasks. We also show empirically that having access to a set of independent policies can better bootstrap the learning process on downstream tasks where the new reward function cannot be described as a linear combination of the features. Finally, we demonstrate that this policy set can be useful in a realistic lifelong reinforcement learning setting.
    Optimal Model Averaging of Support Vector Machines in Diverging Model Spaces. (arXiv:2112.12961v2 [stat.ML] UPDATED)
    (0 min) Support vector machine (SVM) is a powerful classification method that has achieved great success in many fields. Since its performance can be seriously impaired by redundant covariates, model selection techniques are widely used for SVM with high dimensional covariates. As an alternative to model selection, significant progress has been made in the area of model averaging in the past decades. Yet no frequentist model averaging method was considered for SVM. This work aims to fill the gap and to propose a frequentist model averaging procedure for SVM which selects the optimal weight by cross validation. Even when the number of covariates diverges at an exponential rate of the sample size, we show asymptotic optimality of the proposed method in the sense that the ratio of its hinge loss to the lowest possible loss converges to one. We also derive the convergence rate which provides more insights to model averaging. Compared to model selection methods of SVM which require a tedious but critical task of tuning parameter selection, the model averaging method avoids the task and shows promising performances in the empirical studies.
    Decentralized Optimization Over the Stiefel Manifold by an Approximate Augmented Lagrangian Function. (arXiv:2112.14949v1 [math.OC])
    (2 min) In this paper, we focus on the decentralized optimization problem over the Stiefel manifold, which is defined on a connected network of $d$ agents. The objective is an average of $d$ local functions, and each function is privately held by an agent and encodes its data. The agents can only communicate with their neighbors in a collaborative effort to solve this problem. In existing methods, multiple rounds of communications are required to guarantee the convergence, giving rise to high communication costs. In contrast, this paper proposes a decentralized algorithm, called DESTINY, which only invokes a single round of communications per iteration. DESTINY combines gradient tracking techniques with a novel approximate augmented Lagrangian function. The global convergence to stationary points is rigorously established. Comprehensive numerical experiments demonstrate that DESTINY has a strong potential to deliver a cutting-edge performance in solving a variety of testing problems.
    Application of Hierarchical Temporal Memory Theory for Document Categorization. (arXiv:2112.14820v1 [cs.CL])
    (2 min) The current work intends to study the performance of the Hierarchical Temporal Memory(HTM) theory for automated classification of text as well as documents. HTM is a biologically inspired theory based on the working principles of the human neocortex. The current study intends to provide an alternative framework for document categorization using the Spatial Pooler learning algorithm in the HTM Theory. As HTM accepts only a stream of binary data as input, Latent Semantic Indexing(LSI) technique is used for extracting the top features from the input and converting them into binary format. The Spatial Pooler algorithm converts the binary input into sparse patterns with similar input text having overlapping spatial patterns making it easy for classifying the patterns into categories. The results obtained prove that HTM theory, although is in its nascent stages, performs at par with most of the popular machine learning based classifiers.
    The SAMME.C2 algorithm for severely imbalanced multi-class classification. (arXiv:2112.14868v1 [stat.ML])
    (2 min) Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often considered the more interesting class yet developing a scientific learning algorithm suitable for the observations presents countless challenges. In this article, we suggest a novel multi-class classification algorithm specialized to handle severely imbalanced classes based on the method we refer to as SAMME.C2. It blends the flexible mechanics of the boosting techniques from SAMME algorithm, a multi-class classifier, and Ada.C2 algorithm, a cost-sensitive binary classifier designed to address highly class imbalances. Not only do we provide the resulting algorithm but we also establish scientific and statistical formulation of our proposed SAMME.C2 algorithm. Through numerical experiments examining various degrees of classifier difficulty, we demonstrate consistent superior performance of our proposed model.
    A general technique for the estimation of farm animal body part weights from CT scans and its applications in a rabbit breeding program. (arXiv:2112.15095v1 [cs.CV])
    (2 min) Various applications of farm animal imaging are based on the estimation of weights of certain body parts and cuts from the CT images of animals. In many cases, the complexity of the problem is increased by the enormous variability of postures in CT images due to the scanning of non-sedated, living animals. In this paper, we propose a general and robust approach for the estimation of the weights of cuts and body parts from the CT images of (possibly) living animals. We adapt multi-atlas based segmentation driven by elastic registration and joint feature and model selection for the regression component to cape with the large number of features and low number of samples. The proposed technique is evaluated and illustrated through real applications in rabbit breeding programs, showing r^2 scores 12% higher than previous techniques and methods that used to drive the selection so far. The proposed technique is easily adaptable to similar problems, consequently, it is shared in an open source software package for the benefit of the community.
    Dimensionality reduction for prediction: Application to Bitcoin and Ethereum. (arXiv:2112.15036v1 [q-fin.ST])
    (2 min) The objective of this paper is to assess the performances of dimensionality reduction techniques to establish a link between cryptocurrencies. We have focused our analysis on the two most traded cryptocurrencies: Bitcoin and Ethereum. To perform our analysis, we took log returns and added some covariates to build our data set. We first introduced the pearson correlation coefficient in order to have a preliminary assessment of the link between Bitcoin and Ethereum. We then reduced the dimension of our data set using canonical correlation analysis and principal component analysis. After performing an analysis of the links between Bitcoin and Ethereum with both statistical techniques, we measured their performance on forecasting Ethereum returns with Bitcoin s features.
    Aim in Climate Change and City Pollution. (arXiv:2112.15115v1 [cs.LG])
    (2 min) The sustainability of urban environments is an increasingly relevant problem. Air pollution plays a key role in the degradation of the environment as well as the health of the citizens exposed to it. In this chapter we provide a review of the methods available to model air pollution, focusing on the application of machine-learning methods. In fact, machine-learning methods have proved to importantly increase the accuracy of traditional air-pollution approaches while limiting the development cost of the models. Machine-learning tools have opened new approaches to study air pollution, such as flow-dynamics modelling or remote-sensing methodologies.
    Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous graph neural networks. (arXiv:2112.14936v1 [cs.LG])
    (2 min) Heterogeneous graph neural networks (HGNNs) have been blossoming in recent years, but the unique data processing and evaluation setups used by each work obstruct a full understanding of their advancements. In this work, we present a systematical reproduction of 12 recent HGNNs by using their official codes, datasets, settings, and hyperparameters, revealing surprising findings about the progress of HGNNs. We find that the simple homogeneous GNNs, e.g., GCN and GAT, are largely underestimated due to improper settings. GAT with proper inputs can generally match or outperform all existing HGNNs across various scenarios. To facilitate robust and reproducible HGNN research, we construct the Heterogeneous Graph Benchmark (HGB), consisting of 11 diverse datasets with three tasks. HGB standardizes the process of heterogeneous graph data splits, feature processing, and performance evaluation. Finally, we introduce a simple but very strong baseline Simple-HGN--which significantly outperforms all previous models on HGB--to accelerate the advancement of HGNNs in the future.
    Two Instances of Interpretable Neural Network for Universal Approximations. (arXiv:2112.15026v1 [cs.LG])
    (2 min) This paper proposes two bottom-up interpretable neural network (NN) constructions for universal approximation, namely Triangularly-constructed NN (TNN) and Semi-Quantized Activation NN (SQANN). The notable properties are (1) resistance to catastrophic forgetting (2) existence of proof for arbitrarily high accuracies on training dataset (3) for an input \(x\), users can identify specific samples of training data whose activation ``fingerprints" are similar to that of \(x\)'s activations. Users can also identify samples that are out of distribution.
    Binary Diffing as a Network Alignment Problem via Belief Propagation. (arXiv:2112.15337v1 [cs.LG])
    (2 min) In this paper, we address the problem of finding a correspondence, or matching, between the functions of two programs in binary form, which is one of the most common task in binary diffing. We introduce a new formulation of this problem as a particular instance of a graph edit problem over the call graphs of the programs. In this formulation, the quality of a mapping is evaluated simultaneously with respect to both function content and call graph similarities. We show that this formulation is equivalent to a network alignment problem. We propose a solving strategy for this problem based on max-product belief propagation. Finally, we implement a prototype of our method, called QBinDiff, and propose an extensive evaluation which shows that our approach outperforms state of the art diffing tools.
    Training Quantized Deep Neural Networks via Cooperative Coevolution. (arXiv:2112.14834v1 [cs.NE])
    (2 min) Quantizing deep neural networks (DNNs) has been a promising solution for deploying deep neural networks on embedded devices. However, most of the existing methods do not quantize gradients, and the process of quantizing DNNs still has a lot of floating-point operations, which hinders the further applications of quantized DNNs. To solve this problem, we propose a new heuristic method based on cooperative coevolution for quantizing DNNs. Under the framework of cooperative coevolution, we use the estimation of distribution algorithm to search for the low-bits weights. Specifically, we first construct an initial quantized network from a pre-trained network instead of random initialization and then start searching from it by restricting the search space. So far, the problem is the largest discrete problem known to be solved by evolutionary algorithms. Experiments show that our method can train 4 bit ResNet-20 on the Cifar-10 dataset without sacrificing accuracy.
    InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer. (arXiv:2112.15320v1 [cs.LG])
    (2 min) Many social media users prefer consuming content in the form of videos rather than text. However, in order for content creators to produce videos with a high click-through rate, much editing is needed to match the footage to the music. This posts additional challenges for more amateur video makers. Therefore, we propose a novel attention-based model VMT (Video-Music Transformer) that automatically generates piano scores from video frames. Using music generated from models also prevent potential copyright infringements that often come with using existing music. To the best of our knowledge, there is no work besides the proposed VMT that aims to compose music for video. Additionally, there lacks a dataset with aligned video and symbolic music. We release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and MIDI files. We conduct experiments with human evaluation on VMT, SeqSeq model (our baseline), and the original piano version soundtrack. VMT achieves consistent improvements over the baseline on music smoothness and video relevance. In particular, with the relevance scores and our case study, our model has shown the capability of multimodality on frame-level actors' movement for music generation. Our VMT model, along with the new dataset, presents a promising research direction toward composing the matching soundtrack for videos. We have released our code at https://github.com/linchintung/VMT
    SimSR: Simple Distance-based State Representation for Deep Reinforcement Learning. (arXiv:2112.15303v1 [cs.LG])
    (2 min) This work explores how to learn robust and generalizable state representation from image-based observations with deep reinforcement learning methods. Addressing the computational complexity, stringent assumptions, and representation collapse challenges in the existing work of bisimulation metric, we devise Simple State Representation (SimSR) operator, which achieves equivalent functionality while reducing the complexity by an order in comparison with bisimulation metric. SimSR enables us to design a stochastic-approximation-based method that can practically learn the mapping functions (encoders) from observations to latent representation space. Besides the theoretical analysis, we experimented and compared our work with recent state-of-the-art solutions in visual MuJoCo tasks. The results show that our model generally achieves better performance and has better robustness and good generalization.
    Learning Inception Attention for Image Synthesis and Image Recognition. (arXiv:2112.14804v1 [cs.CV])
    (2 min) Image synthesis and image recognition have witnessed remarkable progress, but often at the expense of computationally expensive training and inference. Learning lightweight yet expressive deep model has emerged as an important and interesting direction. Inspired by the well-known split-transform-aggregate design heuristic in the Inception building block, this paper proposes a Skip-Layer Inception Module (SLIM) that facilitates efficient learning of image synthesis models, and a same-layer variant (dubbed as SLIM too) as a stronger alternative to the well-known ResNeXts for image recognition. In SLIM, the input feature map is first split into a number of groups (e.g., 4).Each group is then transformed to a latent style vector(via channel-wise attention) and a latent spatial mask (via spatial attention). The learned latent masks and latent style vectors are aggregated to modulate the target feature map. For generative learning, SLIM is built on a recently proposed lightweight Generative Adversarial Networks (i.e., FastGANs) which present a skip-layer excitation(SLE) module. For few-shot image synthesis tasks, the proposed SLIM achieves better performance than the SLE work and other related methods. For one-shot image synthesis tasks, it shows stronger capability of preserving images structures than prior arts such as the SinGANs. For image classification tasks, the proposed SLIM is used as a drop-in replacement for convolution layers in ResNets (resulting in ResNeXt-like models) and achieves better accuracy in theImageNet-1000 dataset, with significantly smaller model complexity
    Resource-Efficient Deep Learning: A Survey on Model-, Arithmetic-, and Implementation-Level Techniques. (arXiv:2112.15131v1 [cs.LG])
    (2 min) Deep learning is pervasive in our daily life, including self-driving cars, virtual assistants, social network services, healthcare services, face recognition, etc. However, deep neural networks demand substantial compute resources during training and inference. The machine learning community has mainly focused on model-level optimizations such as architectural compression of deep learning models, while the system community has focused on implementation-level optimization. In between, various arithmetic-level optimization techniques have been proposed in the arithmetic community. This article provides a survey on resource-efficient deep learning techniques in terms of model-, arithmetic-, and implementation-level techniques and identifies the research gaps for resource-efficient deep learning techniques across the three different level techniques. Our survey clarifies the influence from higher to lower-level techniques based on our resource-efficiency metric definition and discusses the future trend for resource-efficient deep learning research.
    On Distinctive Properties of Universal Perturbations. (arXiv:2112.15329v1 [cs.LG])
    (2 min) We identify properties of universal adversarial perturbations (UAPs) that distinguish them from standard adversarial perturbations. Specifically, we show that targeted UAPs generated by projected gradient descent exhibit two human-aligned properties: semantic locality and spatial invariance, which standard targeted adversarial perturbations lack. We also demonstrate that UAPs contain significantly less signal for generalization than standard adversarial perturbations -- that is, UAPs leverage non-robust features to a smaller extent than standard adversarial perturbations.
    Pose Estimation of Specific Rigid Objects. (arXiv:2112.15075v1 [cs.CV])
    (3 min) In this thesis, we address the problem of estimating the 6D pose of rigid objects from a single RGB or RGB-D input image, assuming that 3D models of the objects are available. This problem is of great importance to many application fields such as robotic manipulation, augmented reality, and autonomous driving. First, we propose EPOS, a method for 6D object pose estimation from an RGB image. The key idea is to represent an object by compact surface fragments and predict the probability distribution of corresponding fragments at each pixel of the input image by a neural network. Each pixel is linked with a data-dependent number of fragments, which allows systematic handling of symmetries, and the 6D poses are estimated from the links by a RANSAC-based fitting method. EPOS outperformed all RGB and most RGB-D and D methods on several standard datasets. Second, we present HashMatch, an RGB-D method that slides a window over the input image and searches for a match against templates, which are pre-generated by rendering 3D object models in different orientations. The method applies a cascade of evaluation stages to each window location, which avoids exhaustive matching against all templates. Third, we propose ObjectSynth, an approach to synthesize photorealistic images of 3D object models for training methods based on neural networks. The images yield substantial improvements compared to commonly used images of objects rendered on top of random photographs. Fourth, we introduce T-LESS, the first dataset for 6D object pose estimation that includes 3D models and RGB-D images of industry-relevant objects. Fifth, we define BOP, a benchmark that captures the status quo in the field. BOP comprises eleven datasets in a unified format, an evaluation methodology, an online evaluation system, and public challenges held at international workshops organized at the ICCV and ECCV conferences.
    A Survey of Deep Learning Techniques for Dynamic Branch Prediction. (arXiv:2112.14911v1 [cs.AR])
    (2 min) Branch prediction is an architectural feature that speeds up the execution of branch instruction on pipeline processors and reduces the cost of branching. Recent advancements of Deep Learning (DL) in the post Moore's Law era is accelerating areas of automated chip design, low-power computer architectures, and much more. Traditional computer architecture design and algorithms could benefit from dynamic predictors based on deep learning algorithms which learns from experience by optimizing its parameters on large number of data. In this survey paper, we focus on traditional branch prediction algorithms, analyzes its limitations, and presents a literature survey of how deep learning techniques can be applied to create dynamic branch predictors capable of predicting conditional branch instructions. Prior surveys in this field focus on dynamic branch prediction techniques based on neural network perceptrons. We plan to improve the survey based on latest research in DL and advanced Machine Learning (ML) based branch predictors.
    Bayesian Algorithms Learn to Stabilize Unknown Continuous-Time Systems. (arXiv:2112.15094v1 [eess.SY])
    (2 min) Linear dynamical systems are canonical models for learning-based control of plants with uncertain dynamics. The setting consists of a stochastic differential equation that captures the state evolution of the plant understudy, while the true dynamics matrices are unknown and need to be learned from the observed data of state trajectory. An important issue is to ensure that the system is stabilized and destabilizing control actions due to model uncertainties are precluded as soon as possible. A reliable stabilization procedure for this purpose that can effectively learn from unstable data to stabilize the system in a finite time is not currently available. In this work, we propose a novel Bayesian learning algorithm that stabilizes unknown continuous-time stochastic linear systems. The presented algorithm is flexible and exposes effective stabilization performance after a remarkably short time period of interacting with the system.
  • stat.ML updates on arXiv.org

    Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. (arXiv:2106.11612v2 [cs.LG] UPDATED)
    (2 min) We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.
    Bounding Wasserstein distance with couplings. (arXiv:2112.03152v2 [stat.CO] CROSS LISTED)
    (2 min) Markov chain Monte Carlo (MCMC) provides asymptotically consistent estimates of intractable posterior expectations as the number of iterations tends to infinity. However, in large data applications, MCMC can be computationally expensive per iteration. This has catalyzed interest in sampling methods such as approximate MCMC, which trade off asymptotic consistency for improved computational speed. In this article, we propose estimators based on couplings of Markov chains to assess the quality of such asymptotically biased sampling methods. The estimators give empirical upper bounds of the Wassertein distance between the limiting distribution of the asymptotically biased sampling method and the original target distribution of interest. We establish theoretical guarantees for our upper bounds and show that our estimators can remain effective in high dimensions. We apply our quality measures to stochastic gradient MCMC, variational Bayes, and Laplace approximations for tall data and to approximate MCMC for Bayesian logistic regression in 4500 dimensions and Bayesian linear regression in 50000 dimensions.
    Sparse Generalized Yule-Walker Estimation for Large Spatio-temporal Autoregressions with an Application to NO2 Satellite Data. (arXiv:2108.02864v2 [econ.EM] UPDATED)
    (2 min) We consider a high-dimensional model in which variables are observed over time and space. The model consists of a spatio-temporal regression containing a time lag and a spatial lag of the dependent variable. Unlike classical spatial autoregressive models, we do not rely on a predetermined spatial interaction matrix, but infer all spatial interactions from the data. Assuming sparsity, we estimate the spatial and temporal dependence fully data-driven by penalizing a set of Yule-Walker equations. This regularization can be left unstructured, but we also propose customized shrinkage procedures when observations originate from spatial grids (e.g. satellite images). Finite sample error bounds are derived and estimation consistency is established in an asymptotic framework wherein the sample size and the number of spatial units diverge jointly. Exogenous variables can be included as well. A simulation exercise shows strong finite sample performance compared to competing procedures. As an empirical application, we model satellite measured NO2 concentrations in London. Our approach delivers forecast improvements over a competitive benchmark and we discover evidence for strong spatial interactions.
    Towards the global vision of engagement of Generation Z at the workplace: Mathematical modeling. (arXiv:2112.15401v1 [econ.GN])
    (2 min) Correlation and cluster analyses (k-Means, Gaussian Mixture Models) were performed on Generation Z engagement surveys at the workplace. The clustering indicates relations between various factors that describe the engagement of employees. The most noticeable factors are a clear statement about the responsibilities at work, and challenging work. These factors are essential in practice. The results of this paper can be used in preparing better motivational systems aimed at Generation Z employees.
    A theory of independent mechanisms for extrapolation in generative models. (arXiv:2004.00184v2 [cs.LG] UPDATED)
    (2 min) Generative models can be trained to emulate complex empirical data, but are they useful to make predictions in the context of previously unobserved environments? An intuitive idea to promote such extrapolation capabilities is to have the architecture of such model reflect a causal graph of the true data generating process, such that one can intervene on each node independently of the others. However, the nodes of this graph are usually unobserved, leading to overparameterization and lack of identifiability of the causal structure. We develop a theoretical framework to address this challenging situation by defining a weaker form of identifiability, based on the principle of independence of mechanisms. We demonstrate on toy examples that classical stochastic gradient descent can hinder the model's extrapolation capabilities, suggesting independence of mechanisms should be enforced explicitly during training. Experiments on deep generative models trained on real world data support these insights and illustrate how the extrapolation capabilities of such models can be leveraged.
    Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. (arXiv:2112.07068v2 [stat.ML] UPDATED)
    (2 min) Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.
    When are Iterative Gaussian Processes Reliably Accurate?. (arXiv:2112.15246v1 [cs.LG])
    (2 min) While recent work on conjugate gradient methods and Lanczos decompositions have achieved scalable Gaussian process inference with highly accurate point predictions, in several implementations these iterative methods appear to struggle with numerical instabilities in learning kernel hyperparameters, and poor test likelihoods. By investigating CG tolerance, preconditioner rank, and Lanczos decomposition rank, we provide a particularly simple prescription to correct these issues: we recommend that one should use a small CG tolerance ($\epsilon \leq 0.01$) and a large root decomposition size ($r \geq 5000$). Moreover, we show that L-BFGS-B is a compelling optimizer for Iterative GPs, achieving convergence with fewer gradient updates.
    High Dimensional Optimization through the Lens of Machine Learning. (arXiv:2112.15392v1 [math.OC])
    (2 min) This thesis reviews numerical optimization methods with machine learning problems in mind. Since machine learning models are highly parametrized, we focus on methods suited for high dimensional optimization. We build intuition on quadratic models to figure out which methods are suited for non-convex optimization, and develop convergence proofs on convex functions for this selection of methods. With this theoretical foundation for stochastic gradient descent and momentum methods, we try to explain why the methods used commonly in the machine learning field are so successful. Besides explaining successful heuristics, the last chapter also provides a less extensive review of more theoretical methods, which are not quite as popular in practice. So in some sense this work attempts to answer the question: Why are the default Tensorflow optimizers included in the defaults?
    Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation. (arXiv:2110.06394v2 [cs.LG] UPDATED)
    (2 min) We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state. We show that to obtain an $\epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $\tilde{\mathcal{O}}(H^5d^2\epsilon^{-2})$ episodes during the exploration phase. Here, $H$ is the length of the episode, $d$ is the dimension of the feature mapping. We also propose a variant of UCRL-RFE using Bernstein-type bonus and show that it needs to sample at most $\tilde{\mathcal{O}}(H^4d(H + d)\epsilon^{-2})$ to achieve an $\epsilon$-optimal policy. By constructing a special class of linear Mixture MDPs, we also prove that for any reward-free algorithm, it needs to sample at least $\tilde \Omega(H^2d\epsilon^{-2})$ episodes to obtain an $\epsilon$-optimal policy. Our upper bound matches the lower bound in terms of the dependence on $\epsilon$ and the dependence on $d$ if $H \ge d$.
    Infinite wide (finite depth) Neural Networks benefit from multi-task learning unlike shallow Gaussian Processes -- an exact quantitative macroscopic characterization. (arXiv:2112.15577v1 [cs.LG])
    (2 min) We prove in this paper that wide ReLU neural networks (NNs) with at least one hidden layer optimized with l2-regularization on the parameters enforces multi-task learning due to representation-learning - also in the limit width to infinity. This is in contrast to multiple other idealized settings discussed in the literature where wide (ReLU)-NNs loose their ability to benefit from multi-task learning in the limit width to infinity. We deduce the multi-task learning ability from proving an exact quantitative macroscopic characterization of the learned NN in function space.
    Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework. (arXiv:2002.01711v5 [cs.LG] UPDATED)
    (2 min) A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. Major challenges arise in online experiments of two-sided marketplace platforms (e.g., Uber) where there is only one unit that receives a sequence of treatments over time. In those experiments, the treatment at a given time impacts current outcome as well as future outcomes. The aim of this paper is to introduce a reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both simulated data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. A Python implementation of our test is available at https://github.com/callmespring/CausalRL.
    G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators. (arXiv:1906.09338v2 [cs.LG] UPDATED)
    (2 min) Recent advances in machine learning have largely benefited from the massive accessible training data. However, large-scale data sharing has raised great privacy concerns. In this work, we propose a novel privacy-preserving data Generative model based on the PATE framework (G-PATE), aiming to train a scalable differentially private data generator that preserves high generated data utility. Our approach leverages generative adversarial nets to generate data, combined with private aggregation among different discriminators to ensure strong privacy guarantees. Compared to existing approaches, G-PATE significantly improves the use of privacy budgets. In particular, we train a student data generator with an ensemble of teacher discriminators and propose a novel private gradient aggregation mechanism to ensure differential privacy on all information that flows from teacher discriminators to the student generator. In addition, with random projection and gradient discretization, the proposed gradient aggregation mechanism is able to effectively deal with high-dimensional gradient vectors. Theoretically, we prove that G-PATE ensures differential privacy for the data generator. Empirically, we demonstrate the superiority of G-PATE over prior work through extensive experiments. We show that G-PATE is the first work being able to generate high-dimensional image data with high data utility under limited privacy budgets ($\epsilon \le 1$). Our code is available at https://github.com/AI-secure/G-PATE.
    Modelling matrix time series via a tensor CP-decomposition. (arXiv:2112.15423v1 [stat.ME])
    (2 min) We propose to model matrix time series based on a tensor CP-decomposition. Instead of using an iterative algorithm which is the standard practice for estimating CP-decompositions, we propose a new and one-pass estimation procedure based on a generalized eigenanalysis constructed from the serial dependence structure of the underlying process. A key idea of the new procedure is to project a generalized eigenequation defined in terms of rank-reduced matrices to a lower-dimensional one with full-ranked matrices, to avoid the intricacy of the former of which the number of eigenvalues can be zero, finite and infinity. The asymptotic theory has been established under a general setting without the stationarity. It shows, for example, that all the component coefficient vectors in the CP-decomposition are estimated consistently with the different error rates, depending on the relative sizes between the dimensions of time series and the sample size. The proposed model and the estimation method are further illustrated with both simulated and real data; showing effective dimension-reduction in modelling and forecasting matrix time series.
    Efficient Robust Training via Backward Smoothing. (arXiv:2010.01278v2 [cs.LG] UPDATED)
    (2 min) Adversarial training is so far the most effective strategy in defending against adversarial examples. However, it suffers from high computational costs due to the iterative adversarial attacks in each training step. Recent studies show that it is possible to achieve fast Adversarial Training by performing a single-step attack with random initialization. However, such an approach still lags behind state-of-the-art adversarial training algorithms on both stability and model robustness. In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem. Following this new perspective, we also propose a new initialization strategy, backward smoothing, to further improve the stability and model robustness over single-step robust training methods. Experiments on multiple benchmarks demonstrate that our method achieves similar model robustness as the original TRADES method while using much less training time ($\sim$3x improvement with the same training schedule).
    Bayesian Algorithms Learn to Stabilize Unknown Continuous-Time Systems. (arXiv:2112.15094v1 [eess.SY])
    (2 min) Linear dynamical systems are canonical models for learning-based control of plants with uncertain dynamics. The setting consists of a stochastic differential equation that captures the state evolution of the plant understudy, while the true dynamics matrices are unknown and need to be learned from the observed data of state trajectory. An important issue is to ensure that the system is stabilized and destabilizing control actions due to model uncertainties are precluded as soon as possible. A reliable stabilization procedure for this purpose that can effectively learn from unstable data to stabilize the system in a finite time is not currently available. In this work, we propose a novel Bayesian learning algorithm that stabilizes unknown continuous-time stochastic linear systems. The presented algorithm is flexible and exposes effective stabilization performance after a remarkably short time period of interacting with the system.
    Decentralized Optimization Over the Stiefel Manifold by an Approximate Augmented Lagrangian Function. (arXiv:2112.14949v1 [math.OC])
    (2 min) In this paper, we focus on the decentralized optimization problem over the Stiefel manifold, which is defined on a connected network of $d$ agents. The objective is an average of $d$ local functions, and each function is privately held by an agent and encodes its data. The agents can only communicate with their neighbors in a collaborative effort to solve this problem. In existing methods, multiple rounds of communications are required to guarantee the convergence, giving rise to high communication costs. In contrast, this paper proposes a decentralized algorithm, called DESTINY, which only invokes a single round of communications per iteration. DESTINY combines gradient tracking techniques with a novel approximate augmented Lagrangian function. The global convergence to stationary points is rigorously established. Comprehensive numerical experiments demonstrate that DESTINY has a strong potential to deliver a cutting-edge performance in solving a variety of testing problems.
    Benign Overfitting in Adversarially Robust Linear Classification. (arXiv:2112.15250v1 [cs.LG])
    (2 min) "Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in over-parameterized linear regression, classification, and kernel methods. However, it is not clear if benign overfitting still occurs in the presence of adversarial examples, i.e., examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of sub-Gaussian data under $\ell_p$ adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the near-optimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.
    Data-driven advice for interpreting local and global model predictions in bioinformatics problems. (arXiv:2108.06201v2 [stat.ML] UPDATED)
    (2 min) Tree-based algorithms such as random forests and gradient boosted trees continue to be among the most popular and powerful machine learning models used across multiple disciplines. The conventional wisdom of estimating the impact of a feature in tree based models is to measure the \textit{node-wise reduction of a loss function}, which (i) yields only global importance measures and (ii) is known to suffer from severe biases. Conditional feature contributions (CFCs) provide \textit{local}, case-by-case explanations of a prediction by following the decision path and attributing changes in the expected output of the model to each feature along the path. However, Lundberg et al. pointed out a potential bias of CFCs which depends on the distance from the root of a tree. The by now immensely popular alternative, SHapley Additive exPlanation (SHAP) values appear to mitigate this bias but are computationally much more expensive. Here we contribute a thorough comparison of the explanations computed by both methods on a set of 164 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores, leading to very similar rankings and interpretations. Analogous conclusions hold for the fidelity of using global feature importance scores as a proxy for the predictive power associated with each feature.
    From Twitter to Traffic Predictor: Next-Day Morning Traffic Prediction Using Social Media Data. (arXiv:2009.13794v3 [cs.SI] UPDATED)
    (3 min) The effectiveness of traditional traffic prediction methods is often extremely limited when forecasting traffic dynamics in early morning. The reason is that traffic can break down drastically during the early morning commute, and the time and duration of this break-down vary substantially from day to day. Early morning traffic forecast is crucial to inform morning-commute traffic management, but they are generally challenging to predict in advance, particularly by midnight. In this paper, we propose to mine Twitter messages as a probing method to understand the impacts of people's work and rest patterns in the evening/midnight of the previous day to the next-day morning traffic. The model is tested on freeway networks in Pittsburgh as experiments. The resulting relationship is surprisingly simple and powerful. We find that, in general, the earlier people rest as indicated from Tweets, the more congested roads will be in the next morning. The occurrence of big events in the evening before, represented by higher or lower tweet sentiment than normal, often implies lower travel demand in the next morning than normal days. Besides, people's tweeting activities in the night before and early morning are statistically associated with congestion in morning peak hours. We make use of such relationships to build a predictive framework which forecasts morning commute congestion using people's tweeting profiles extracted by 5 am or as late as the midnight prior to the morning. The Pittsburgh study supports that our framework can precisely predict morning congestion, particularly for some road segments upstream of roadway bottlenecks with large day-to-day congestion variation. Our approach considerably outperforms those existing methods without Twitter message features, and it can learn meaningful representation of demand from tweeting profiles that offer managerial insights.
    Random cohort effects and age groups dependency structure for mortality modelling and forecasting: Mixed-effects time-series model approach. (arXiv:2112.15258v1 [stat.AP])
    (2 min) There have been significant efforts devoted to solving the longevity risk given that a continuous growth in population ageing has become a severe issue for many developed countries over the past few decades. The Cairns-Blake-Dowd (CBD) model, which incorporates cohort effects parameters in its parsimonious design, is one of the most well-known approaches for mortality modelling at higher ages and longevity risk. This article proposes a novel mixed-effects time-series approach for mortality modelling and forecasting with considerations of age groups dependence and random cohort effects parameters. The proposed model can disclose more mortality data information and provide a natural quantification of the model parameters uncertainties with no pre-specified constraint required for estimating the cohort effects parameters. The abilities of the proposed approach are demonstrated through two applications with empirical male and female mortality data. The proposed approach shows remarkable improvements in terms of forecast accuracy compared to the CBD model in the short-, mid-and long-term forecasting using mortality data of several developed countries in the numerical examples.
    Binary Diffing as a Network Alignment Problem via Belief Propagation. (arXiv:2112.15337v1 [cs.LG])
    (2 min) In this paper, we address the problem of finding a correspondence, or matching, between the functions of two programs in binary form, which is one of the most common task in binary diffing. We introduce a new formulation of this problem as a particular instance of a graph edit problem over the call graphs of the programs. In this formulation, the quality of a mapping is evaluated simultaneously with respect to both function content and call graph similarities. We show that this formulation is equivalent to a network alignment problem. We propose a solving strategy for this problem based on max-product belief propagation. Finally, we implement a prototype of our method, called QBinDiff, and propose an extensive evaluation which shows that our approach outperforms state of the art diffing tools.
    Multiple Testing and Variable Selection along the path of the Least Angle Regression. (arXiv:1906.12072v4 [math.ST] UPDATED)
    (3 min) We investigate multiple testing and variable selection using the Least Angle Regression (LARS) algorithm in high dimensions under the assumption of Gaussian noise. LARS is known to produce a piecewise affine solution path with change points referred to as the knots of the LARS path. The key to our results is an expression in closed form of the exact joint law of a $K$-tuple of knots conditional on the variables selected by LARS, namely the so-called post-selection joint law of the LARS knots. Numerical experiments demonstrate the perfect fit of our findings. This paper makes three main contributions. First, we build testing procedures on variables entering the model along the LARS path in the general design case when the noise level can be unknown. These testing procedures are referred to as the Generalized $t$-Spacing tests (GtSt) and we prove that they have an exact non-asymptotic level (i.e., the Type I error is exactly controlled). This extends work of (Taylor et al., 2014) where the spacing test works for consecutive knots and known variance. Second, we introduce a new exact multiple false negatives test after model selection in the general design case when the noise level may be unknown. We prove that this testing procedure has exact non-asymptotic level for general design and unknown noise level. Third, we give an exact control of the false discovery rate under orthogonal design assumption. Monte Carlo simulations and a real data experiment are provided to illustrate our results in this case. Of independent interest, we introduce an equivalent formulation of the LARS algorithm based on a recursive function.
    Triangular Flows for Generative Modeling: Statistical Consistency, Smoothness Classes, and Fast Rates. (arXiv:2112.15595v1 [stat.ML])
    (2 min) Triangular flows, also known as Kn\"{o}the-Rosenblatt measure couplings, comprise an important building block of normalizing flow models for generative modeling and density estimation, including popular autoregressive flow models such as real-valued non-volume preserving transformation models (Real NVP). We present statistical guarantees and sample complexity bounds for triangular flow statistical models. In particular, we establish the statistical consistency and the finite sample convergence rates of the Kullback-Leibler estimator of the Kn\"{o}the-Rosenblatt measure coupling using tools from empirical process theory. Our results highlight the anisotropic geometry of function classes at play in triangular flows, shed light on optimal coordinate ordering, and lead to statistical guarantees for Jacobian flows. We conduct numerical experiments on synthetic data to illustrate the practical implications of our theoretical findings.
    Non Asymptotic Bounds for Optimization via Online Multiplicative Stochastic Gradient Descent. (arXiv:2112.07110v4 [stat.ML] UPDATED)
    (2 min) The gradient noise of Stochastic Gradient Descent (SGD) is considered to play a key role in its properties (e.g. escaping low potential points and regularization). Past research has indicated that the covariance of the SGD error done via minibatching plays a critical role in determining its regularization and escape from low potential points. It is however not much explored how much the distribution of the error influences the behavior of the algorithm. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the Multiplicative Stochastic Gradient Descent (M-SGD) algorithm as introduced by Wu et al., which has a much more general noise class than the SGD algorithm done via minibatching. We establish nonasymptotic bounds for the M-SGD algorithm mainly with respect to the Stochastic Differential Equation corresponding to SGD via minibatching. We also show that the M-SGD error is approximately a scaled Gaussian distribution with mean $0$ at any fixed point of the M-SGD algorithm. We also establish bounds for the convergence of the M-SGD algorithm in the strongly convex regime.
    Sufficient Statistic Memory AMP. (arXiv:2112.15327v1 [cs.IT])
    (2 min) Approximate message passing (AMP) is a promising technique for unknown signal reconstruction of certain high-dimensional linear systems with non-Gaussian signaling. A distinguished feature of the AMP-type algorithms is that their dynamics can be rigorously described by state evolution. However, state evolution does not necessarily guarantee the convergence of iterative algorithms. To solve the convergence problem of AMP-type algorithms in principle, this paper proposes a memory AMP (MAMP) under a sufficient statistic condition, named sufficient statistic MAMP (SS-MAMP). We show that the covariance matrices of SS-MAMP are L-banded and convergent. Given an arbitrary MAMP, we can construct an SS-MAMP by damping, which not only ensures the convergence of MAMP but also preserves the orthogonality of MAMP, i.e., its dynamics can be rigorously described by state evolution. As a byproduct, we prove that the Bayes-optimal orthogonal/vector AMP (BO-OAMP/VAMP) is an SS-MAMP. As a result, we reveal two interesting properties of BO-OAMP/VAMP for large systems: 1) the covariance matrices are L-banded and are convergent in BO-OAMP/VAMP, and 2) damping and memory are useless (i.e., do not bring performance improvement) in BO-OAMP/VAMP. As an example, we construct a sufficient statistic Bayes-optimal MAMP (BO-MAMP), which is Bayes optimal if its state evolution has a unique fixed point and its MSE is not worse than the original BO-MAMP. Finally, simulations are provided to verify the validity and accuracy of the theoretical results.
    The SAMME.C2 algorithm for severely imbalanced multi-class classification. (arXiv:2112.14868v1 [stat.ML])
    (2 min) Classification predictive modeling involves the accurate assignment of observations in a dataset to target classes or categories. There is an increasing growth of real-world classification problems with severely imbalanced class distributions. In this case, minority classes have much fewer observations to learn from than those from majority classes. Despite this sparsity, a minority class is often considered the more interesting class yet developing a scientific learning algorithm suitable for the observations presents countless challenges. In this article, we suggest a novel multi-class classification algorithm specialized to handle severely imbalanced classes based on the method we refer to as SAMME.C2. It blends the flexible mechanics of the boosting techniques from SAMME algorithm, a multi-class classifier, and Ada.C2 algorithm, a cost-sensitive binary classifier designed to address highly class imbalances. Not only do we provide the resulting algorithm but we also establish scientific and statistical formulation of our proposed SAMME.C2 algorithm. Through numerical experiments examining various degrees of classifier difficulty, we demonstrate consistent superior performance of our proposed model.
    How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?. (arXiv:1911.12360v4 [cs.LG] UPDATED)
    (2 min) A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $\epsilon^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.
    Studying the Interplay between Information Loss and Operation Loss in Representations for Classification. (arXiv:2112.15238v1 [cs.LG])
    (2 min) Information-theoretic measures have been widely adopted in the design of features for learning and decision problems. Inspired by this, we look at the relationship between i) a weak form of information loss in the Shannon sense and ii) the operation loss in the minimum probability of error (MPE) sense when considering a family of lossy continuous representations (features) of a continuous observation. We present several results that shed light on this interplay. Our first result offers a lower bound on a weak form of information loss as a function of its respective operation loss when adopting a discrete lossy representation (quantization) instead of the original raw observation. From this, our main result shows that a specific form of vanishing information loss (a weak notion of asymptotic informational sufficiency) implies a vanishing MPE loss (or asymptotic operational sufficiency) when considering a general family of lossy continuous representations. Our theoretical findings support the observation that the selection of feature representations that attempt to capture informational sufficiency is appropriate for learning, but this selection is a rather conservative design principle if the intended goal is achieving MPE in classification. Supporting this last point, and under some structural conditions, we show that it is possible to adopt an alternative notion of informational sufficiency (strictly weaker than pure sufficiency in the mutual information sense) to achieve operational sufficiency in learning.
    Neural Thompson Sampling. (arXiv:2010.00827v2 [cs.LG] UPDATED)
    (2 min) Thompson Sampling (TS) is one of the most effective algorithms for solving contextual multi-armed bandit problems. In this paper, we propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation. At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network. We prove that, provided the underlying reward function is bounded, the proposed algorithm is guaranteed to achieve a cumulative regret of $\mathcal{O}(T^{1/2})$, which matches the regret of other contextual bandit algorithms in terms of total round number $T$. Experimental comparisons with other benchmark bandit algorithms on various data sets corroborate our theory.
    Optimal Model Averaging of Support Vector Machines in Diverging Model Spaces. (arXiv:2112.12961v2 [stat.ML] UPDATED)
    (2 min) Support vector machine (SVM) is a powerful classification method that has achieved great success in many fields. Since its performance can be seriously impaired by redundant covariates, model selection techniques are widely used for SVM with high dimensional covariates. As an alternative to model selection, significant progress has been made in the area of model averaging in the past decades. Yet no frequentist model averaging method was considered for SVM. This work aims to fill the gap and to propose a frequentist model averaging procedure for SVM which selects the optimal weight by cross validation. Even when the number of covariates diverges at an exponential rate of the sample size, we show asymptotic optimality of the proposed method in the sense that the ratio of its hinge loss to the lowest possible loss converges to one. We also derive the convergence rate which provides more insights to model averaging. Compared to model selection methods of SVM which require a tedious but critical task of tuning parameter selection, the model averaging method avoids the task and shows promising performances in the empirical studies.
    Hierarchical forecasting with a top-down alignment of independent level forecasts. (arXiv:2103.08250v4 [stat.ML] UPDATED)
    (2 min) Hierarchical forecasting with intermittent time series is a challenge in both research and empirical studies. Extensive research focuses on improving the accuracy of each hierarchy, especially the intermittent time series at bottom levels. Then hierarchical reconciliation could be used to improve the overall performance further. In this paper, we present a \emph{hierarchical-forecasting-with-alignment} approach that treats the bottom level forecasts as mutable to ensure higher forecasting accuracy on the upper levels of the hierarchy. We employ a pure deep learning forecasting approach N-BEATS for continuous time series at the top levels and a widely used tree-based algorithm LightGBM for the intermittent time series at the bottom level. The \emph{hierarchical-forecasting-with-alignment} approach is a simple yet effective variant of the bottom-up method, accounting for biases that are difficult to observe at the bottom level. It allows suboptimal forecasts at the lower level to retain a higher overall performance. The approach in this empirical study was developed by the first author during the M5 Forecasting Accuracy competition, ranking second place. The method is also business orientated and could benefit for business strategic planning.
    Separation of scales and a thermodynamic description of feature learning in some CNNs. (arXiv:2112.15383v1 [stat.ML])
    (2 min) Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Due to their scale and complexity, often involving billions of inter-dependent internal degrees of freedom, exact analysis approaches often fall short. A common strategy in such cases is to identify slow degrees of freedom that average out the erratic behavior of the underlying fast microscopic variables. Here, we identify such a separation of scales occurring in over-parameterized deep convolutional neural networks (CNNs) at the end of training. It implies that neuron pre-activations fluctuate in a nearly Gaussian manner with a deterministic latent kernel. While for CNNs with infinitely many channels these kernels are inert, for finite CNNs they adapt and learn from data in an analytically tractable manner. The resulting thermodynamic theory of deep learning yields accurate predictions on several deep non-linear CNN toy models. In addition, it provides new ways of analyzing and understanding CNNs.
    Improved Algorithm for the Network Alignment Problem with Application to Binary Diffing. (arXiv:2112.15336v1 [cs.LG])
    (2 min) In this paper, we present a novel algorithm to address the Network Alignment problem. It is inspired from a previous message passing framework of Bayati et al. [2] and includes several modifications designed to significantly speed up the message updates as well as to enforce their convergence. Experiments show that our proposed model outperforms other state-of-the-art solvers. Finally, we propose an application of our method in order to address the Binary Diffing problem. We show that our solution provides better assignment than the reference differs in almost all submitted instances and outline the importance of leveraging the graphical structure of binary programs.
    Bayesian Optimization of Function Networks. (arXiv:2112.15311v1 [cs.LG])
    (2 min) We consider Bayesian optimization of the output of a network of functions, where each function takes as input the output of its parent nodes, and where the network takes significant time to evaluate. Such problems arise, for example, in reinforcement learning, engineering design, and manufacturing. While the standard Bayesian optimization approach observes only the final output, our approach delivers greater query efficiency by leveraging information that the former ignores: intermediate output within the network. This is achieved by modeling the nodes of the network using Gaussian processes and choosing the points to evaluate using, as our acquisition function, the expected improvement computed with respect to the implied posterior on the objective. Although the non-Gaussian nature of this posterior prevents computing our acquisition function in closed form, we show that it can be efficiently maximized via sample average approximation. In addition, we prove that our method is asymptotically consistent, meaning that it finds a globally optimal solution as the number of evaluations grows to infinity, thus generalizing previously known convergence results for the expected improvement. Notably, this holds even though our method might not evaluate the domain densely, instead leveraging problem structure to leave regions unexplored. Finally, we show that our approach dramatically outperforms standard Bayesian optimization methods in several synthetic and real-world problems.
    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v1 [cs.LG])
    (2 min) One of the reasons that many neural networks are capable of replicating complicated tasks or functions is their universality property. The past few decades have seen many attempts in providing constructive proofs for single or class of neural networks. This paper is an effort to provide a unified and constructive framework for the universality of a large class of activations including most of existing activations and beyond. At the heart of the framework is the concept of neural network approximate identity. It turns out that most of existing activations are neural network approximate identity, and thus universal in the space of continuous of functions on compacta. The framework induces several advantages. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activations. Third, as a by product, the framework provides the first university proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it discovers new activations with guaranteed universality property. Indeed, any activation\textemdash whose $\k$th derivative, with $\k$ being an integer, is integrable and essentially bounded\textemdash is universal. Fifth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neuron, and the values of weights/biases.
    A sampling-based approach for efficient clustering in large datasets. (arXiv:2112.14793v1 [cs.LG])
    (2 min) We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters. We show that the optimal solutions of our approximation are the same as in the exact solution. However, our approach is considerably more efficient at extracting these clusters compared to the state-of-the-art. We compare our approximation with the exact k-means and alternative approximation approaches on a series of standardised clustering tasks. For the evaluation, we consider the algorithmic complexity, including number of operations to convergence, and the stability of the results.
    Entropy Regularized Optimal Transport Independence Criterion. (arXiv:2112.15265v1 [stat.ML])
    (2 min) Optimal transport (OT) and its entropy regularized offspring have recently gained a lot of attention in both machine learning and AI domains. In particular, optimal transport has been used to develop probability metrics between probability distributions. We introduce in this paper an independence criterion based on entropy regularized optimal transport. Our criterion can be used to test for independence between two samples. We establish non-asymptotic bounds for our test statistic, and study its statistical behavior under both the null and alternative hypothesis. Our theoretical results involve tools from U-process theory and optimal transport theory. We present experimental results on existing benchmarks, illustrating the interest of the proposed criterion.
    Time varying regression with hidden linear dynamics. (arXiv:2112.14862v1 [math.ST])
    (2 min) We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and discuss certain advantages it has over Expectation-Maximization (EM), which is the main approach proposed by prior work.
  • Featured Blog Posts - Data Science Central

    Data Monetization Approach for B2B2C Industries
    (6 min) Most of the Big Data, Data Science and AI / ML success stories seem to revolve around Business-to-Consumer (B2C) industries.  The companies that we see leading the economic exploitation of data and analytics are primarily B2C companies such as Apple, Alphabet (Google), Microsoft, Amazon, and Facebook (Figure 1).…
    IoT Drives Growth of Intelligent Transportation Systems
    (3 min) The global IoT in intelligent transportation systems market is anticipated to be driven by the rising focus of industry players on research and development, aimed at enhancing the integrated IoT software in order to minimize the cost of operation associated with these tools. Furthermore, the increasing focus on data collection pertaining to…
  • AWS Machine Learning Blog

    Identity verification using Amazon Rekognition
    (9 min) In-person user identity verification is slow to scale, costly, and high friction for users. Machine learning (ML) powered facial recognition technology can enable online user identity verification. Amazon Rekognition offers pre-trained facial recognition capabilities that you can quickly add to your user onboarding and authentication workflows to verify opted-in users’ identities online. No ML expertise […]
  • Jay Alammar

    The Illustrated Retrieval Transformer
    (5 min) Discussion: Discussion Thread for comments, corrections, or any feedback. Summary: The latest batch of language models can be much smaller yet achieve GPT-3 like performance by being able to query a database or search the web for information. A key indication is that building larger and larger models is not the only way to improve performance. The last few years saw the rise of Large Language Models (LLMs) – machine learning models that rapidly improve how machines process and generate language. Some of the highlights since 2017 include: The original Transformer breaks previous performance records for machine translation. BERT popularizes the pre-training then finetuning process, as well as Transformer-based contextualized word embeddings. It then rapidly starts to power Google Search and Bing Search. GPT-2 demonstrates the machine’s ability to write as well as humans do. First T5, then T0 push the boundaries of transfer learning (training a model on one task, and then having it do well on other adjacent tasks) and posing a lot of different tasks as text-to-text tasks. GPT-3 showed that massive scaling of generative models can lead to shocking emergent applications (the industry continues to train larger models like Gopher, MT-NLG…etc). For a while, it seemed like scaling larger and larger models is the main way to improve performance. Recent developments in the field, like DeepMind’s RETRO Transformer and OpenAI’s WebGPT, reverse this trend by showing that smaller generative language models can perform on par with massive models if we augment them with a way to search/query for information. This article breaks down DeepMind’s RETRO (Retrieval-Enhanced TRansfOrmer) and how it works. The model performs on par with GPT-3 despite being 4% its size (7.5 billion parameters vs. 185 billion for GPT-3 Da Vinci). RETRO incorporates information retrieved from a database to free its parameters from being an expensive store of facts and world knowledge. RETRO was presented in the paper Improving Language Models by Retrieving from Trillions of Tokens. It continues and builds on a wide variety of retrieval work in the research community. This article explains the model and not what is especially novel about it.
  • Artificial Intelligence

    Amazon Research Introduces Deep Reinforcement Learning For NLU Ranking Tasks
    (1 min) In recent years, voice-based virtual assistants such as Google Assistant and Amazon Alexa have grown popular. This has presented both potential and challenges for natural language understanding (NLU) systems. These devices’ production systems are often trained by supervised learning and rely significantly on annotated data. But, data annotation is costly and time-consuming. Furthermore, model updates using offline supervised learning can take long and miss trending requests. In the underlying architecture of voice-based virtual assistants, the NLU model often categorizes user requests into hypotheses for downstream applications to fulfill. A hypothesis comprises two tags: user intention (intent) and Named Entity Recognition (NER). For example, the valid hypothesis for “play a Madonna song” will be: PlaySong intent, ArtistName – Madonna. A new Amazon research introduces deep reinforcement learning strategies for NLU ranking. Their work analyses a ranking question in an NLU system in which entirely independent domain experts generate hypotheses with their features, where a domain is a functional area such as Music, Shopping, or Weather. These hypotheses are then ranked based on their scores, calculated based on their characteristics. As a result, the ranker must calibrate features from domain experts and select one hypothesis according to policy. Continue Reading Research Paper: https://assets.amazon.science/b3/74/77ff47044b69820c466f0624a0ab/introducing-deep-reinforcement-learning-to-nlu-ranking-tasks.pdf ​ https://preview.redd.it/27wx7281ui981.png?width=1920&format=png&auto=webp&s=61264372ce2854031e4acd1221c802a3e042a0a8 submitted by /u/ai-lover [link] [comments]
    AI Will Make It Difficult to Discern Information from Misinformation (1-minute audio clip from Eric Schmidt, former Google CEO)
    (1 min) submitted by /u/frog9913 [link] [comments]
    NLP: Hybridization of statistical approach and expert system?
    (1 min) Hi everyone! I have a question for you. For context, we aggregate on a platform the various AI APIs on the market (GCP, Azure, etc.) and including NLP APIs (keyword extraction, sentiment analysis, NER, etc.). The idea is that a developer doesn't have to create accounts with different providers and can have them all on one API to test, compare and change whenever he wants. However, many customers ask us how to mix the "statistical" approach behind these APIs with expert systems and how to achieve hybridization. Do you have any idea how to do this? Thanks, submitted by /u/tah_zem [link] [comments]
    Top 10 Object Detection APIs
    (0 min) submitted by /u/tah_zem [link] [comments]
    Open Domain Question Answering Part-1 [BlenderBot 2.0]
    (0 min) submitted by /u/coffeeroach [link] [comments]
  • Machine Learning

    [R] 🐸YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
    (1 min) YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS... it is possible to fine-tune the model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. 🤖 Demo: https://coqui.ai 👩‍💻 Code: https://github.com/coqui-ai/tts 🚀 Blogpost: https://coqui.ai/blog/tts/yourtts-zero-shot-text-synthesis-low-resource-languages 📎 Paper: https://arxiv.org/abs/2112.02418 submitted by /u/josh-r-meyer [link] [comments]
    [D] "why academia tends to under-invest in engineering infrastructure?"
    (1 min) Tweet from @jackclarkSF asks an interesting question: Is there a good paper that explains how/why academia tends to under-invest in engineering infrastructure? https://twitter.com/jackclarkSF/status/1478077579110207489 submitted by /u/MassivePellfish [link] [comments]
    [D] How to measure accuracy of kNN Imputation?
    (1 min) I have a dataset in which the the best way to impute missing values is to use kNN but before I go ahead and do that I'd like to check what kind of accuracy I have with that form of imputation in this specific dataset and which k should be used. My original solution was as follows: From my original dataset, remove all rows with missing values From this dataset, impute NaNs randomly throughout the dataset with the same frequency that they were missing originally and store the values that were replaced with NaN in a new dataset as the ground truth Impute using kNN Check the accuracy of the imputed values against the ground truth values stored in step 2 using MAE for different k values Is there an easier way to do this? If not, should I be using MAE or another accuracy score? submitted by /u/Ok-Culture-9123 [link] [comments]
    [D] Paper that mathematically proves that gradient descent can achieve zero training error.
    (2 min) I think this is a well-known paper, but I have not been able to find it. I am interested in the paper that mathematically proves neural nets can fit any set of datapoints. So far what I have found mostly is papers that show empirically that or something related to that, like this one. I'd appreciate any help. Edit: u/the_new_scientist shared this paper which is what I was looking for: https://arxiv.org/pdf/1810.02054.pdf Also, I apologize for my vague description. Now that the paper is shown, I hope it is more clear to future readers what kind of results I meant, but in case that is not case, I was wondering about this question: under what conditions can a neural network achieve zero training error? and in particular, I am interested in papers with mathematical (even without empirical) results. submitted by /u/carlml [link] [comments]
    [D] NLP: Hybridization of statistical approach and expert system ?
    (1 min) Hi everyone! I have a question for you. For context, we aggregate on a platform the various AI APIs on the market (GCP, Azure, etc.) and including NLP APIs (keyword extraction, sentiment analysis, NER, etc.). The idea is that a developer doesn't have to create accounts with different providers and can have them all on one API to test, compare and change whenever he wants. However, many customers ask us how to mix the "statistical" approach behind these APIs with expert systems and how to achieve hybridization. Do you have any idea how to do this? Thanks, submitted by /u/tah_zem [link] [comments]
    [P] Bringing serverless to ML - stateful, arbitrary dependency, serverless for ML
    (1 min) Serverless infrastructure is yet practical to use for ML but we think it could bring lots of benefits. So, with a friend, we decided to make serverless easy for ML and we are building a platform to solve the main issues we find in serverless for ML: - Stateful: We don´t want to reload a whole model every time a user calls model.predict - Arbritary dependecies: Normal python code with any package dependencies you use and love, just many many times in parallel - Scale-up and scale-down: scale up with ease and auto shutdown to keep resources consumptionYou can visit our webpage, try the demo, and request early access to use our platform! Webpage: https://telekinesis.cloud Happy to receive questions and comments on what we are building! submitted by /u/snuns90 [link] [comments]
    [D] What are your hopes for Machine Learning in 2022?
    (1 min) Hi r/MachineLearning! I was just wondering what some of you are hoping ML can accomplish or overcome in this new year - interested in hearing your thoughts! submitted by /u/DataGeek0 [link] [comments]
    [R] The Illustrated Retrieval Transformer (GPT3 performance at 4% the size)
    (4 min) Hi r/MachineLearning, I spent some time wrapping my head around DeepMind's Retro Transformer and visualizing how it works. Hope you find it useful. All feedback is welcome! http://jalammar.github.io/illustrated-retrieval-transformer/ submitted by /u/jayalammar [link] [comments]
    [D] What causes feature collapse?
    (1 min) For those of you unfamiliar feature collapse is when you train a model for classification and the model ends up mapping out-of-distribution data or data of different classes in very close proximity in multi-dimensional space. So for example once your model learns a cluster so to speak for cat, during test it projects a dog into the center of that cluster and classifies it as cat. Some ways to sort of deal with this in CV is double gradient penalty and spectral norm of resnet blocks, but what causes feature collapse? submitted by /u/DolantheMFWizard [link] [comments]
    [R] New paper: "A relational Tsetlin machine with applications to natural language understanding"
    (2 min) ​ Relational Tsetlin Machine The paper introduces the first Relational #TsetlinMachine, which reasons with relations, variables, and constants. The approach is based on first-order logic and Herbrand semantics, taking the first steps toward the computing power of a universal Turing machine. The approach can take advantage of logical structures appearing in natural language, to learn rules that represent how actions and consequences are related in the real world. The outcome is a logic program of Horn clauses, bringing in a structured view of unstructured data. In closed-domain question-answering, the first-order representation produces 10× more compact knowledge bases, along with an increase in answering accuracy from 94.83% to 99.48%. The approach is further robust towards erroneous, missing, and superfluous information, distilling the aspects of a text that are important for real-world understanding. https://link.springer.com/article/10.1007/s10844-021-00682-5 #ML #AI #NLP #MachineLearning #Logic #Relational submitted by /u/olegranmo [link] [comments]
    [D] How to deal with huge Categorical data
    (2 min) I have a dataset that already contains about 55 columns and out of this, around 10 Columns or so have categorical data in it. If I were to OneHotEncode them, I will end up having a column count of more than 300. Is this something advisable? How do you people deal with such huge number of columns? I mean 300 columns is not a big deal, but I would like to know your opinion and thoughts on this. submitted by /u/CaterpillarPrevious2 [link] [comments]
    [D] Spotted this post in LessWrong. Can anyone verify the rather fantastic claims being made here?
    (4 min) https://www.lesswrong.com/posts/rCP5iTYLtfcoC8NXd/self-organised-neural-networks-a-simple-natural-and#Roadmap The writing has some red flags, but it looks interesting enough. Having some trouble with my gpu drivers so I can't run it right now. submitted by /u/Their_bad_spellers [link] [comments]
    [R] A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More
    (0 min) submitted by /u/shitboots [link] [comments]
    [D] Is there flow-based method which treats input data as different lengths each?
    (1 min) Hello. I am searching the researches that different size of data are generated through the flow-based network, not super-resolution task such as continuous mapping. I want to generate output as time-aligned scalar data, for example ​ Input: noise sampling (B x T x C) Output: scalar data (B x T, C=1) ​ with introducing the variational data augmentation technique (in vFlow, which can output high-dimensionality as concatenate noise vector for input and output both) for output. But there's a problem time dimension T is different for each of all data input. How can I treat this problem? ​ p.s. I am very appreciate if I can read the flow-based research in NLP task. submitted by /u/RedCuraceo [link] [comments]
    [D] Anyone switched from vision to robotics?
    (1 min) I’m about to finish my PhD and the whole field of robotics looks so exciting right now, especially applications like farming and recycling. Has anyone switched from more pure deep learning (vision / NLP) to robotics and how did it happen? Did you just get a robotics related job focusing on the vision side of things or is it key to have more experience on the robotics side before getting a job? Also I’m curious what’s the best location for robotics? Like how you go to Hong Kong / New York for finance, SF for software or Shenzhen for hardware. submitted by /u/temporary_ml_guy [link] [comments]
    [P] I like YOLOv5 but the code complexity is...
    (2 min) I like YOLOv5 but the code complexity is... I can't deny that YOLOv5 is a practical open-source object detection pipeline. However, the pain begins when adding new features or new experimental methods. Code dependencies are hard to follow which makes the code difficult to maintain. We wanted to try various experimental methods but hate to write one-time code that is never re-used. So we worked on making an object detection pipeline to have a better code structure so that we could continuously improve and add new features while easy to maintain. https://github.com/j-marple-dev/AYolov2 And we applied CI(Formating, Linting, Unittest) to ensure code quality with Docker support for development and inference. Our Docker supports the development environment with VIM. Our code design from the…
    [D] GUI-based Machine Learning applications?
    (1 min) I was previously using Azure Machine Learning Studio(classic), and of course, it was discontinued last month. Any other free ML applications? The new Azure Machine Learning Studio isn't free, and this is a school project so I'm aiming for free and simple. Any suggestions? Or maybe someone else is using Studio(classic) and knows a way around this? submitted by /u/max02c [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Interested in learning but not really sure where to start
    (1 min) For some time now I've been reading about neural networks and ai, the topic has really interested me a lot and I have seen amazing GAN projects like artbreeder, this has motivated me to want to work with this kind of technology, the only problem is that I can't really figure out where to start, so till this moment the only thing i've done is a Python introductory course. What should or could I do next? P.S: Spanish is my native language so excuse me if my english is a little rusty. submitted by /u/luisaalberto [link] [comments]
  • Reinforcement Learning

    The Best Machine Learning Courses on Udemy (2022)
    (0 min) submitted by /u/JanPrince002 [link] [comments]
    Amazon Research Introduces Deep Reinforcement Learning For NLU Ranking Tasks
    (1 min) In recent years, voice-based virtual assistants such as Google Assistant and Amazon Alexa have grown popular. This has presented both potential and challenges for natural language understanding (NLU) systems. These devices’ production systems are often trained by supervised learning and rely significantly on annotated data. But, data annotation is costly and time-consuming. Furthermore, model updates using offline supervised learning can take long and miss trending requests. In the underlying architecture of voice-based virtual assistants, the NLU model often categorizes user requests into hypotheses for downstream applications to fulfill. A hypothesis comprises two tags: user intention (intent) and Named Entity Recognition (NER). For example, the valid hypothesis for “play a Madonna song” will be: PlaySong intent, ArtistName – Madonna. A new Amazon research introduces deep reinforcement learning strategies for NLU ranking. Their work analyses a ranking question in an NLU system in which entirely independent domain experts generate hypotheses with their features, where a domain is a functional area such as Music, Shopping, or Weather. These hypotheses are then ranked based on their scores, calculated based on their characteristics. As a result, the ranker must calibrate features from domain experts and select one hypothesis according to policy. Continue Reading Research Paper: https://assets.amazon.science/b3/74/77ff47044b69820c466f0624a0ab/introducing-deep-reinforcement-learning-to-nlu-ranking-tasks.pdf ​ https://preview.redd.it/fp79wa92ui981.png?width=1920&format=png&auto=webp&s=647db1d2bd5be9dc9698c980e2146fef499368f3 submitted by /u/ai-lover [link] [comments]
    Average reward formulation for continuing settings
    (2 min) I have a problem that is -for its nature- continuing, i.e. there are no episodes and the agent has to operate for an infinite time. In my simulator, however, I have the possibility of simulating episodes of any length. Is the average reward formulation appropriate for such a problem? I do not know much about it and I have a few questions for you: Could you point me to some literature that addresses average reward and its pros/cons vs discounted reward for continuing problems? Is the average taken on a window of finite size or on the limit for the size that goes to infinity? How do common RL algorithms work in this formulation? Can we still adapt them to use average reward? submitted by /u/fedetask [link] [comments]
    Formula to compute loss in A3C
    (1 min) I'm a beginner to RL and I'm trying to understand how the loss function was computed. If it follows a specific formular. I've read the a3c algorithm overview on paper by barto but it seems the implemtation here https://github.com/MorvanZhou/pytorch-A3C/blob/master/discrete_A3C.py is different. submitted by /u/phissy08 [link] [comments]
    Introduction to Markov Decision Processes - Draft Chapters 2 and 3
    (1 min) submitted by /u/YouAgainShmidhoobuh [link] [comments]
    DQN with online learning
    (2 min) I saw another post on here that mentioned how DQN's can be implemented via online learning, offline learning (e.g. batching), or a combination (offline learning first and then online learning in production). What I'm confused at is the actual deployment paradigm to facilitate online learning (either as the first step or with hybrid). Is the full training code (with a checkpoint model if hybrid) and agent deployed to production with the training step logic executed on-demand for each state transition? If so, would this mean that model compilation / quantization is not possible (e.g. Onnx runtime to INT8 precision)? How is it scaled if it's having to perform SGD and back-propagation with each state transition? ​ One the other side, if the model is compiled / quantized, how does it implement online learning (if this is even possible)? submitted by /u/FinateAI [link] [comments]
    Any good papers for non-stationary environment?
    (1 min) I want to create a control system for control agents(players). The players will move non-stationary, so is there any good technique or papers to train this control system? Very Thanks. submitted by /u/Spiritual_Fig3632 [link] [comments]
    How to change MDP into POMDP? (getting pixel observations)
    (1 min) Hi, I'm looking into OpenAI robogym (https://github.com/openai/robogym)'s rearrange environments, and want to try RL-based control on it with pixel observations. The default observations are not pixels, and I can't seem to find any documentation online on how to change these environments into pixel observations. Can anyone point me to resources where similar changes have been done? Are there preexisting wrappers or code for similar mujoco environments anywhere? My other option is to simply render the environment with mode rgb_array, and then train the algorithm based on these rendered observations. Is this a viable option? submitted by /u/junkQn [link] [comments]
    Discrete Actions (OpenAI Hide & Seek)
    (1 min) Hello :), Happy new year to all of you! I have a question regarding the action space for the Hide & Seek paper by OpenAI. The architecture of the learning model they used is below. https://preview.redd.it/el54yihfjd981.png?width=794&format=png&auto=webp&s=33c5228ac8003b9bb89b6d008e2cf2aea5f7e071 They have eventually 3 action heads, as shown in the figure. As per the authors, the movement action for an agent "sets a discretized force along their x and y axis and torque around their z-axis". How do these 3 forces get pulled from the "movement action head", given that they are discretized? Does the movement action head have 3 outputs, each corresponding to an axis? So far, in my humble knowledge, I always assumed that a network with a discrete action space gives only one action per head (which could be the result of a softmax over the action space). Any insight? After some digging, I found that the Movement Action has a "MultiDiscrete" type, which is a gym class. I am struggling to understand the nature of actions produced by this class, any idea? submitted by /u/AhmedNizam_ [link] [comments]

2022-01-02

  • Featured Blog Posts - Data Science Central

    Artificial Intelligence for Mental Health
    (4 min) Happy new year! This year , let’s start with an issue that gained so much prominence over the last two years: mental health. A variety of AI techniques and strategies have been employed in mental health. In this post, we summarise the findings based on a paper (link below)   Summary The main predictor…
  • Artificial Intelligence

    I built an AI Discord bot that bans NFT Bros from my server
    (0 min) submitted by /u/TernaryJimbo [link] [comments]
    What can AI ‘discover’ in data that I was not expecting?
    (2 min) Help me understand this please. I’ve read that people feed data to an AI/ML algorithm and find things that they weren’t looking for but there aren’t many good examples out there to read about. So I’m thinking: if I were to feed into an AI/ML algorithm a csv of order details and customer data from an eCommerce system, should I expect the AI engine to “find” something statistically usual/unusual and tell me about it? Or do I need to instruct it to look for certain traits with e-commerce orders and customers that I already know about? Eg look out for fraud by checking x y z. What if the thing to look for is so strange that you need the AI to pick it up. Eg maybe fraudsters start telling their friends to always use a certain name or phone number as an inside joke. I would think an AI might pick on that somehow whereas it might take a human a longer time to figure that out. submitted by /u/rich_atl [link] [comments]
    Software 3.0: Prompt programming
    (0 min) submitted by /u/Respawne [link] [comments]
    Russian Bioinformaticians Have Created A Neural Network Architecture That Can Evaluate How Well An RNA Guide Has Been Chosen For Gene Editing
    (1 min) Genomic editing, particularly the CRISPR/Cas technique, is widely employed in experimental biology, agriculture, and biotechnology. CRISPR/Cas is one of several weapons used by bacteria to resist viruses. As the pathogen’s DNA enters the cell, Cas proteins detect it as foreign hereditary material and break it because its sequences differ from those of the bacteria. To respond to the virus quicker, the bacterium saves pieces of the pathogen’s DNA—much like a computer antivirus retains a collection of viral signatures—and passes them on to subsequent generations so that its Cas can prevent future attacks. Teams from different laboratories independently adapted the CRISPR/Cas system to introduce arbitrary changes into DNA sequences in human and animal cells. It made genomic editing much easier and more efficient. The critical components of the mechanism are guide RNA, which “marks the site,” and the Cas9 protein, which cleaves DNA at that location. The cell subsequently “heals the wound,” but the genetic code has already been altered. Quick Reading: https://www.marktechpost.com/2022/01/02/russian-bioinformaticians-have-created-a-neural-network-architecture-that-can-evaluate-how-well-an-rna-guide-has-been-chosen-for-gene-editing/ Paper: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab1065/6430490 submitted by /u/ai-lover [link] [comments]
    Tang Jie, the Tsinghua University professor leading the Wu Dao project, said in a recent interview that the group built 100 TRILLION parameter model in June, though it has not trained it to “convergence,” the point at which the model stops improving
    (1 min) submitted by /u/No-Transition-6630 [link] [comments]
    The top 10 AI/Computer Vision papers in 2021 with video demos, articles, and code for each!
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    What AIs are there which are able to edit images?
    (1 min) My dad loves technology but does not really understand it, I thought I take some pictures from him and AI should edit them so he looks like a simpsons etc. submitted by /u/xXLisa28Xx [link] [comments]
  • Machine Learning

    [R] A Comparison Of The Program Synthesis Performance Of GitHub Copilot And Genetic Programming
    (0 min) submitted by /u/EducationalCicada [link] [comments]
    [D] Machine Learning - WAYR (What Are You Reading) - Week 128
    (1 min) This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 2 Week 12 Week 22 Week 32…
    [D] ICLR 2022 Open Discussion Quality
    (1 min) First time submitting to ICLR, I'm wondering how is the open discussion on openreview.net really different from the review-rebuttal procedure used in other conferences. For the papers I'm reviewing, about half the reviewers reacted to the authors' responses (clarifications, modifications, additional experiments, etc.). As to my submission (5568), I run extra experiments and answered each of the concerns directly but received 0 feedback from the reviewers. As a reviewer, I think it doesn't matter if the rebuttal changes your mind about the quality of the submission but it's very basic manner to reply to the authors' responses. A simple "Thank the authors for the responses but I don't think these addressed my concerns" would work. Saying nothing only means you are uncertain if the responses make sense and you just doesn't care to figure it out. The authors spent a whole week running experiments to answer some of your questions and if you don't give **** about their responses, just keep the questions with you and don't submit your review. I was hoping for a different experience submitting to ICLR and then I realized the "discussion" is basically broken. submitted by /u/MLPRulesAll [link] [comments]
    [P] Tensorflow / Keras implementation of Vision Transformer https://arxiv.org/abs/2010.11929v2
    (0 min) An Image is Worth 16x16 Words: ViT Excellent results compared to SOTA CNNs while requiring fewer computational resources to train. Paper : https://arxiv.org/abs/2010.11929v2 Code : https://github.com/avinash31d/paper-implementations submitted by /u/avinash31d [link] [comments]
    [D] Simple Questions Thread
    (3 min) Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]
    [R] Learning 3D Representations from 2D Images
    (2 min) submitted by /u/pinter69 [link] [comments]
    [D] Paper Explained & Author Interview - Player of Games: All the games, one algorithm! (Video Walkthrough)
    (1 min) https://youtu.be/U0mxx7AoNz0 Special Guest: First author Martin Schmid (https://twitter.com/Lifrordi) Games have been used throughout research as testbeds for AI algorithms, such as reinforcement learning agents. However, different types of games usually require different solution approaches, such as AlphaZero for Go or Chess, and Counterfactual Regret Minimization (CFR) for Poker. Player of Games bridges this gap between perfect and imperfect information games and delivers a single algorithm that uses tree search over public information states, and is trained via self-play. The resulting algorithm can play Go, Chess, Poker, Scotland Yard, and many more games, as well as non-game environments. ​ OUTLINE: 0:00 - Introduction 2:50 - What games can Player of Games be trained on? 4:00 - Tree search algorithms (AlphaZero) 8:00 - What is different in imperfect information games? 15:40 - Counterfactual Value- and Policy-Networks 18:50 - The Player of Games search procedure 28:30 - How to train the network? 34:40 - Experimental Results 47:20 - Discussion & Outlook ​ Paper: https://arxiv.org/abs/2112.03178 submitted by /u/ykilcher [link] [comments]
    [Research][Project] The top 10 AI/Computer Vision papers in 2021 with video demos, articles, and code for each!
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    [D] Machine Learning Research
    (0 min) Hi everyone, I've compiled the trusted sources of ideation based on top-tier conferences on Machine Learning and Deep Learning worldwide. This repository includes datasets, tasks, state-of-the-art and more. Repository GitHub https://preview.redd.it/9p1jkei8c9981.png?width=1000&format=png&auto=webp&s=906c20c58b5ee569b25848a7fbf0b41ca4caf354 submitted by /u/tuanlda78202 [link] [comments]
    [P] Quick-Deploy - Optimize, convert and deploy machine learning models
    (0 min) Hello Reddit, releasing one of my OSS projects: Quick-Deploy .. github: https://github.com/rodrigobaron/quick-deploy blog post: https://rodrigobaron.com/posts/quick-deploy ​ It's in the very early stage, feel free to contribute or give a star 🙂 submitted by /u/rodrigobaron [link] [comments]
    [D] Raising errors while using accelerators
    (1 min) Why is it so hard to raise exceptions when ML pipeline is using a GPU? When for example you make a classic "Index out of bounds" error, in libraries like PyTorch you get some generic "CUDA" error and you can't see the exact error until you transfer the tensors explicitly to CPU and rerun the code. Do you think there is a possibility for this to improve in the future? Sorry if this is more CS-related question submitted by /u/Icy_Fisherman7187 [link] [comments]
    [D] Are NN actually overparametrized?
    (3 min) I often read that NN or CNN are overparametrized. But, for example, resnet18 has 11M parameters while cifar10 has 50k32323=153M data points. How is that be an overparametrized network on cifar10? Or even on mnist which has 60k28*28=47M data points submitted by /u/alesaso2000 [link] [comments]
    [D] Coding Practices
    (7 min) My job is to work with ML engineers and provide them with whatever they need to experiment with/train/test/deploy ML models -- GPU infrastructure, distributed training support, etc. When I interface with their code, I almost always find it so poorly written, with little to no thought given to long-term stability or use -- for code that they 100% know is going to production. They're brilliant people, far smarter than me, and really good at what they do, so it's not a matter of them not being good enough. I feel (from my very limited experience, so I'm happy to be wrong) like ML engineers are incentivized to write poor code. The only metric for evaluation seems to be accuracy, loss, and all the plots that come up. In research, I understand completely, that's where the focus lies, but in industry? I've seen many models perform poorly because the code is so hard to read and refactor that big issues remained unspotted for months together. And this is especially befuddling because for a field that is completely fine with spending months to get an ROI of single digit increases in model performance metrics during the experimentation phase, they don't seem to care about anything that might go wrong in production. That just feels like a fundamental disconnect, since without the core ML stuff working perfectly, none of the other stuff (like what I do) has any value -- and even so, I'm taught to hold my code to a much higher standard than the critical stuff -- which I'm happy about since I can now write production code by default -- but it's just... weird. Like the vending machines at a nuclear power plant being better engineered than the reactor. Is this a common problem or is this a localized issue that I'm facing? submitted by /u/vPyDev [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    I built an AI Discord Bot that bans NFT Bros [Meme][Video]
    (0 min) submitted by /u/TernaryJimbo [link] [comments]
    [project] Credit Scoring
    (0 min) hey peeps, any research thesis regarding credit scoring in microfinance? submitted by /u/abschlusssss [link] [comments]
  • Reinforcement Learning

    "Player of Games", Schmid et al 2021 {DM} (generalizing AlphaZero to imperfect-information games)
    (1 min) submitted by /u/gwern [link] [comments]
    What is the best method/technology to deploy deep reinforcement learning models for robotics?
    (1 min) Hello, I am a robotics enthusiast building a robot arm to perform mundane tasks. So far, I have only been able to deploy deep reinforcement learning models on the Jetson Nano. The process has been easy using the Nvidia TensorRT SDK. What other methods/technology exist? My friend recommended using a google coral and a raspberry pi. Any recommendations would be greatly appreciated. Thank you submitted by /u/AINerd1 [link] [comments]
    Markov Decision Process explained through Memes
    (1 min) Today I published my first article on medium explaining Markov Decision Process with memes. Hope someone finds this helpful. https://medium.com/@saminbinkarim/explaining-markov-decision-process-with-memes-af679a0af343 submitted by /u/sardines_again [link] [comments]
    Proper way to count environment steps / frames in distributed RL architecture for algorithms like CLEAR or LASER => basically modified impala with replay
    (2 min) In classical - on-policy - vtrace/Impala algorithm env_steps are incremented every training iteration like this : env_steps += size_of_batch * unroll_length However when we add replay_buffer to vtrace (like in CLEAR or LASER) we also start to learn from off-policy data therefore some trajectories (from replay) could be sampled more than once. My question is - does the env_step_counting stay the same even when some trajectories from batch are from replay and some directly from workers (on-policy) ? Or do i count steps only on those trajectories that are on-policy => not from replay like : env_steps += len(batch["on_policy_samples"]) * unroll_length And if so what happens when i decide to use only samples from replay to train - how do i count env_steps then ? Using some kind of flag to indicate if specific trajectory was sampled already and leave it from counting ? ​ Have been digging through the research papers for some time now and i haven`t been able to find what is the proper way to do it. Any help or advice would be much appreciated. submitted by /u/ParradoxSVK [link] [comments]
    What should a beginner do after learning some basic ideas?
    (2 min) Hi all, it's my first time to use reddit to ask for suggestions. I am a beginner in RL. I learned some basic ideas in RL (such as SARSA, PPO and A2C) recently and applied Q-network to train a Go game agent. Personally speaking, I am going to pursue for a position as a RL engineer in the future. But I have no idea about what should I do to improve my RL skills. So I am wondering if there are some awesome resources for a beginner like me? Or could you please give me some instructions? submitted by /u/ZavierTi2021 [link] [comments]

2022-01-01

  • Featured Blog Posts - Data Science Central

    Building a hypergraph-based semantic knowledge sharing environment for construction
    (5 min) High-rise construction in the Denny Triangle area of Seattle. Source: Bruce Englehardt, Wikimedia Commons, May 2020 I've been looking for fresh use cases that tell the knowledge graph adoption story from new angles. Which industries, for instance, been having…
  • John D. Cook

    Architecture and Math
    (1 min) I recently received a review copy of Andrew Witt’s new book Formulations: Architecture, Mathematics, and Culture. The Hankel function on the cover is the first clue that this book contains some advanced mathematics. Or rather, it references some advanced mathematics. I’ve only skimmed the book so far, but I didn’t see any equations. Hankel functions […] Architecture and Math first appeared on John D. Cook.
    Turning the Golay problem sideways
    (3 min) I’ve written a couple posts about the Golay problem recently, first here then here. The problem is to find all values of N and n such that is a power of 2, say 2p. Most solutions apparently fall into three categories: n = 0 or n = N, N is odd and n = (N-1)/2, […] Turning the Golay problem sideways first appeared on John D. Cook.
  • Artificial Intelligence

    Hierarchical Federated Learning-Based Anomaly Detection Using Digital Twins For Internet of Medical Things (IoMT)
    (1 min) Smart healthcare services can be provided by using Internet of Things (IoT) technologies that monitor the health conditions of patients and their vital body parameters. The majority of IoT solutions used to enable such services are wearable devices, such as smartwatches, ECG monitors, and blood pressure monitors. The huge amount of data collected from smart medical devices leads to major security and privacy issues in the IoT domain. Considering Remote Patient Monitoring (RPM) applications, we will focus on Anomaly Detection (AD) models, whose purpose is to identify events that differ from the typical user behavior patterns. Generally, while designing centralized AD models, the researchers face security and privacy challenges (e.g., patient data privacy, training data poisoning). To overcome these issues, the researchers of this paper propose an Anomaly Detection (AD) model based on Federated Learning (FL). Federated Learning (FL) allows different devices to collaborate and perform training locally in order to build Anomaly Detection (AD) models without sharing patients’ data. Specifically, the researchers propose a hierarchical Federated Learning (FL) that enables collaboration among different organizations, by building various Anomaly Detection (AD) models for patients with similar health conditions. Continue Reading the Paper Summary: https://www.marktechpost.com/2022/01/01/hierarchical-federated-learning-based-anomaly-detection-using-digital-twins-for-internet-of-medical-things-iomt/ Full Paper: https://arxiv.org/pdf/2111.12241.pdf submitted by /u/ai-lover [link] [comments]
    Generate artistic images with OpenAI’s Glide 🖼
    (0 min) submitted by /u/tridoc [link] [comments]
    My Top 10 Computer Vision papers of 2021
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    Should we be concerned?
    (3 min) Should we be a little worried by how fast AI is developing? submitted by /u/Particular_Leader_16 [link] [comments]
    Is there a community/website/company which is collecting and categorizing AIs?
    (0 min) submitted by /u/xXNOdrugsForMEXx [link] [comments]
  • Machine Learning

    [D] Plug or Integrate a GNN Pytorch code base into Spark Cluster
    (1 min) Does anyone have a better explanation or resources to share for plug or Integrate a Pytorch based GNN models into Pyspark or similar cluster services? submitted by /u/SpiritMaleficent21 [link] [comments]
    [P] DeepCreamPy - Decensoring Hentai with Deep Neural Networks
    (0 min) submitted by /u/binaryfor [link] [comments]
    "[R]" Neuron outputs as weights
    (1 min) https://stats.stackexchange.com/questions/558864/what-if-weights-of-model-is-output-of-neurons submitted by /u/Dry_Introduction_897 [link] [comments]
    [D] Best Practices in Machine Learning
    (1 min) This is a non-profit that promotes best practices in machine learning, specifically for responsible ML. The practices are open source too, which is cool. Link here: https://www.fbpml.org/the-best-practices I think their technical best practices seems a little stronger than the organisational ones. Thoughts? ** this is their LinkedIn URL: https://www.linkedin.com/company/the-foundation-for-best-practices-in-machine-learning submitted by /u/Sbu91 [link] [comments]
    [Research] My Top 10 Computer Vision papers of 2021
    (0 min) submitted by /u/OnlyProggingForFun [link] [comments]
    [R] MT3: Multi-Task Multitrack Music Transcription
    (2 min) submitted by /u/Illustrious_Row_9971 [link] [comments]
    BERT Goes Shopping: Comparing Distributional Models for Product Representations (Paper Walkthrough) [D]
    (0 min) submitted by /u/prakhar21 [link] [comments]
  • Neural Networks, Deep Learning and Machine Learning

    Neural network drawing mushrooms
    (0 min) submitted by /u/noodlefist [link] [comments]
  • Reinforcement Learning

    NetHack 2021 NeurIPS Challenge -- winning agent episode visualizations
    (5 min) Hi All! I am Michał from the AutoAscend team that has won the NetHack 2021 NeurIPS Challenge. I have just shared some episode visualization videos: https://www.youtube.com/playlist?list=PLJ92BrynhLbdQVcz6-bUAeTeUo5i901RQ The winning agent isn't based on reinforcement learning in the end, but the victory of symbolic methods in this competition shows what RL is still missing to some extent -- so I believe this subreddit is a good place to discuss it. We hope that NLE will someday become a new standard benchmark for evaluation next to chess, go, Atari, etc. as it presents a set of whole new complex problems for agents to learn. Contrary to Atari, NetHack levels are procedurally generated, and therefore agents can't memorize the layout. Observations are highly partial, rewards are sparse, and episodes are usually very long. Here are some other useful links related to the competition: Full NeurIPS Session recording: https://www.youtube.com/watch?v=fVkXE330Bh0 AutoAscend team presentation starts here: https://youtu.be/fVkXE330Bh0?t=4437 Competition report: https://nethackchallenge.com/report.html AICrowd Challenge link: https://www.aicrowd.com/challenges/neurips-2021-the-nethack-challenge submitted by /u/procedural_only [link] [comments]
    Some questions on DRL
    (1 min) Hello, I´m applying Deep Reinforcement Learning for the first time, and I have some questions about it (I´ve already looked for an answer but in vain): How to normalize objectives' values in the reward function? if we have an objective that values are in the range of 10 and another objective that values are in the range of 1000. During the training phase, how can we watch the weights updates of a network and the gradient calculation too? In a multi-agent setting and episodic task, for "dones" vector, it will be set to "True" once all the agents are finished, or once an agent finishes the task done[agent_index]=True in other words, we won´t wait the latest agent to finish to set dones = [True]*number_of_agents Thank you. submitted by /u/GuavaAgreeable208 [link] [comments]
    Explained Variance suddenly crashes, and stay around zero and even sometimes go to negative values
    (2 min) Hi, all Happy New Year! I'm working on a project to teach my quad-rotor how to hover, and these are what my rollout graphs looks like. The reward graph constantly increases for some time, but suddenly during a rapid increase in reward, the entropy loss, and the train std of starts to diverge, and the explained variance just crashes. ​ https://preview.redd.it/yv4qbhmbc2981.png?width=826&format=png&auto=webp&s=31322a62963b1de47ed72452dfb36498ce222dd1 ​ https://preview.redd.it/d38papwdc2981.png?width=1373&format=png&auto=webp&s=0ab5b98d0bdad165f0dcfd7f5ea3e54ad965842c From my understanding, this explained variance far from 1 is a bad sign, Could you please point out, or give comments on what would be the possible reason why this is happening in my case? Any kind of reply would be appreciate. Thanks again Happy New Year. :D ​ Hp: ent_coef=0.0001, learning_rate=linear decay from 0.00004, n_steps=1024, batch_size=64 apart from these, I used the default params from stable_baselines3 I scale the action output from [-1, 1] to [min_rpm max_rpm] to actuate the quad-rotor. observations states = input = [pos,vel,acc,ang,ang_vel] // Network size 64,64 // algorithm used:PPO REWARD FUNCTION ​ https://preview.redd.it/o3w2i0r7s9981.png?width=818&format=png&auto=webp&s=f03b4f12d6c57f7b7aaec7dbd03b9fecad2c8f2c submitted by /u/GOMTAE [link] [comments]
    Help With PPO Model Performing Poorly
    (2 min) I am attempting to recreate the PPO algorithm to try to learn the inner workings of the algorithm better and to learn more about actor-critic reinforcement learning. So far, I have a model that seems to learn, just not very well. In the early stages of training, the algorithm seems more sporadic and may happen to find a pretty solid policy, but due to unstable the early parts of training are, it tends to move away from this policy. Eventually, the algorithm moves the policy toward a reward of around 30. For the past few commits in my repo where I have attempted to fix this issue, the policy always tends to the around 30 reward mark, and I'm not entirely sure why it's doing this. I'm thinking maybe I implemented the algorithm incorrectly, but I'm not certain. Can someone please help me with this issue? Below is a link to an image of training using the latest commit, using a previous commit, and my Github project current commit: https://ibb.co/JQgnq1f previous commit: https://ibb.co/rppVHKb GitHub: https://github.com/gmongaras/PPO_CartPole ​ Thanks for your help! submitted by /u/gmongaras [link] [comments]

2021-12-31

  • Featured Blog Posts - Data Science Central

    Lying to blockchains and other Web3 dilemmas
    (6 min) Semantic data consultant @EmekaOkoye made a couple of great points on Twitter back in November and early December 2021: “I am amused how some of us are pretending that #SmartAgent [tech] is not going to be a thing.” …
    Digital Workers in a Business Context
    (4 min) What is a digital worker? Digital workers are your software-based co-workers. Think Siri, Alexa, or Cortana, not exactly like them but to some extent. Yes, we are calling them your co-workers or you can refer to them as your virtual assistants if you like. They will help you in doing…
  • The Official NVIDIA Blog

    5 Ways AI Aimed to Improve the World in 2021
    (3 min) Not so long ago, searching for information could lead to a library to scan endless volumes or even tediously sift through microfilm. Clearly, technology is making the world a better place. Scientists, researchers, developers and companies have been on a quest to solve some of the world’s most pressing problems. Only now they’re accelerating their Read article > The post 5 Ways AI Aimed to Improve the World in 2021 appeared first on The Official NVIDIA Blog.
  • Blog – Machine Learning Mastery

    Anomaly Detection with Isolation Forest and Kernel Density Estimation
    (7 min) Anomaly detection is to find data points that deviate from the norm. In other words, those are the points that […] The post Anomaly Detection with Isolation Forest and Kernel Density Estimation appeared first on Machine Learning Mastery.
  • Seita's Place

    Books Read in 2021
    (28 min) At the end of every year I have a tradition where I write summaries of the books that I read throughout the year. Here’s the following post with the rough set of categories: Popular Science (6 books) History, Government, Politics, Economics (6 books) Biographies / Memoirs (5 books) China (5 books) COVID-19 (2 books) Miscellaneous (7 books) I read 31 books this year. You can find the other blog posts from prior years (going back to 2016) in the blog archives. Popular Science This also includes popular science, which means the authors might not be technically trained as scientists. Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past (2018) is by famous geneticist and Harvard professor David Reich. Scientific advances in analyzing DNA have allowed better analysis…
  • John D. Cook

    Update on the Golay problem
    (2 min) I played around with what I’m calling the Golay problem over Christmas and wrote a blog post about it. I rewrote the post as I learned more about the problem due to experimentation and helpful feedback via comments and Twitter. In short, the Golay problem is to classify the values of N and n such […] Update on the Golay problem first appeared on John D. Cook.
  • Artificial Intelligence

    AI - A love story // AI-generated video about the future of AI // prompt -> GPT-J-6B -> Aphantasia
    (1 min) submitted by /u/6owline1vex [link] [comments]
    AI News in 2021: a Detailed Digest
    (0 min) https://lastweekin.ai/p/ai-news-in-2021-a-digest submitted by /u/regalalgorithm [link] [comments]
    I made a virtual Twitch Streamer, who responds to your chats using OpenAI and Text-to-Speech
    (1 min) submitted by /u/C0de_monkey [link] [comments]
    Happy New Year to everyone! Let's hope for a wonderful 2022. The text is created with polygons that evolve with an Evolutionary Algorithm.
    (1 min) submitted by /u/ValianTek_World [link] [comments]
    [R] Microsoft’s Self-Supervised Bug Detection and Repair Approach Betters Baselines By Up to 30%
    (1 min) In the NeurIPS 2021-accepted paper Self-Supervised Bug Detection and Repair, a Microsoft Research team proposes BUGLAB, a self-supervised approach that significantly improves on baseline methods for detecting bugs in real-life code. Here is a quick read: Microsoft’s Self-Supervised Bug Detection and Repair Approach Betters Baselines By Up to 30%. The code and PyPIBugs dataset are available on the project’s GitHub. The paper Self-Supervised Bug Detection and Repair is on arXiv. submitted by /u/Yuqing7 [link] [comments]
    Commercially available AIs or chatbots that you can explicitly teach/train?
    (4 min) I've been playing around with Replika, and it's incredibly fun and impressive as a general "companion" chatbot, but I find it really disappointing in its lack of ability to learn explicit facts or procedures. It's targeted toward making people feel like they have a friend and roleplaying, not learning anything external to those goals. So, is there anything commercially available that you can both: Talk to (mostly) like a real person, and Teach from the ground up (like a child) ? As an example, my Replika "wanted" to do a lesson with me, so I tried to teach it colors. Ostensibly it has some picture recognition abilities, but despite that, it was never able to learn which color was which, even when using the same image files to display the same color. Okay, fine, forget colors, but it can't learn explicit facts either. I want to be able to input things like "Lacan defines the subject as that which is represented by a signifier for another signifier" or "It's important to apply a primer before your first layer of eyeshadow" or "Leonardo DiCaprio played Jack Dawson in the movie Titanic" and have it be able to actually remember and recall that information in future conversations. (Replika is able to recall it within the same conversation, but it resets after a certain amount of time. It also seems to struggle with remembering things that differ from user to user, like favorite food, since it learns from the aggregate of its conversations.) Is there anything out there that can do this? Anything on the horizon? submitted by /u/peppermint-kiss [link] [comments]
  • Machine Learning

    [P] Play around with StyleGAN2 in your browser
    (1 min) I built a little page to run and manipulate StyleGAN2 in the browser. https://ziyadedher.com/faces It was pretty fun learning about ONNX and how to port GANs to web. You can play around with the random seeds and also distort the intermediate latents to produce some really wacky results. You can check out a GIF on Twitter. Let me know if you come up with anything cool! submitted by /u/Cold999 [link] [comments]
    [P] Top arXiv Machine Learning papers in 2021 according to metacurate.io
    (2 min) With 2021 almost in the books (there are still a couple of hours to go at the time of this writing), here are the top machine learning papers per month from the arXiv pre-print archive as picked up by metacurate.io in 2021. January Can a Fruit Fly Learn Word Embeddings? Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Muppet: Massive Multi-task Representations with Pre-Finetuning February How to represent part-whole hierarchies in a neural network Patterns, predictions, and actions: A story about machine learning Fast Graph Learning with Unique Optimal Solutions March Fast and flexible: Human program induction in abstract reasoning tasks Learning to Resize Images for Computer Vision Tasks The Prevalence of Code Smells in Mac…
  • Neural Networks, Deep Learning and Machine Learning

    Siamese Neural Networks for Semantic Text Similarity
    (0 min) ​ A repository containing comprehensive Neural Networks based PyTorch implementations for the semantic text similarity task, including architectures such as Siamese-LSTM, Siamese-LSTM-Attention, Siamese-Transformer, and Siamese-BERT. https://github.com/shahrukhx01/siamese-nn-semantic-text-similarity submitted by /u/shahrukhx01 [link] [comments]
  • Reinforcement Learning

    Agent not learning! Any Help
    (2 min) Hello Can someone explain why the actor critic maps the states to the same actions, in other words why the actor outputs the same action whatever the states? This what makes the agent learns nothing during training phase. Happy New Year! submitted by /u/LeatherCredit7148 [link] [comments]
    question about baseline with value network
    (1 min) I have a question about baseline in policy gradient method. Based on what is implemented in relation to PG, there are many shared policy and value networks. But I understood baseline should not affect the parameters of the policy. When PG with shared networks and baseline updates value function, then baseline will affect policy network. Is it okay? submitted by /u/Spiritual_Fig3632 [link] [comments]
    How to check feature importance in Reinforcement learning?
    (1 min) I have made a custom environment in openAi Gym using data from a csv file that contains certain features in it as column names .After training a DQN agent on the environment, i wanted to check some meaningful features used by the DQN agent to make the policy but i am unable to find some good resources regarding it. An example of the feature importance graph submitted by /u/EBISU1234 [link] [comments]
    Current unanswered/interesting applications in Multi-armed bandits?
    (1 min) Hi, I am planning on doing my MSc in CS with a focus in RL. More specifically, I want to learn about multi-armed bandits and how it can be used by agents to enable them to perform actions in a diverse environment. I am new to this field and I want to know more about what questions about MAB are unanswered? Any interesting application that may be currently under research? I would really appreciate if anyone can help me out. Thank you! submitted by /u/_UNIPOOL [link] [comments]

2021-12-30

  • Featured Blog Posts - Data Science Central

    Top 12 Software Development Trends in 2022 You Must Watch Out
    (5 min) There are constant changes in the software development trends, but a few trends seem to be dominant in 2022. With the evolution of advanced technology, there has been a significant change in the software development landscape. Businesses need to keep up with these…
  • The Official NVIDIA Blog

    Innovation Inspiration: 5 Startup Stories From NVIDIA Inception in 2021
    (2 min) NVIDIA Inception is one of the largest startup ecosystems in the world — and its thousands of members achieved impressive feats in 2021, bringing AI and data science to an array of industries. NVIDIA Inception nurtures cutting-edge AI, data science and HPC startups with go-to-market support, expertise and technology. This year, the program surpassed 9,000 Read article > The post Innovation Inspiration: 5 Startup Stories From NVIDIA Inception in 2021 appeared first on The Official NVIDIA Blog.
    GFN Thursday Says ‘GGs’ to 2021 With Our Community’s Top Titles of the Year
    (4 min) It’s the final countdown for what’s been a big year for cloud gaming. For the last GFN Thursday of the year, we’re taking a look at some of the GeForce NOW community’s top picks of games that joined the GeForce NOW library in 2021. Plus, check out the last batch of games coming to the Read article > The post GFN Thursday Says ‘GGs’ to 2021 With Our Community’s Top Titles of the Year appeared first on The Official NVIDIA Blog.
  • AI Weirdness

    New Years Resolutions generated by AI
    (6 min) This month I'm beginning 2022 as the first Futurist in Residence at the Smithsonian Arts and Industries Building. It's weird to think of myself as a futurist. I write a lot about the algorithms we're calling artificial intelligence (AI), but rather than deal with
    Bonus: more new year's resolutions to consider
    (1 min) AI Weirdness: the strange side of machine learning
  • John D. Cook

    Names and numbers for modal logic axioms
    (2 min) Stanislaw Ulam once said Using a term like nonlinear science is like referring to the bulk of zoology as the study of non-elephant animals. There is only one way to be linear, but there are many ways to not be linear. A similar observation applies to non-classical logic. There are many ways to not be […] Names and numbers for modal logic axioms first appeared on John D. Cook.
    When do two-body systems have stable Lagrange points?
    (3 min) The previous post looked at two of the five Lagrange points of the Sun-Earth system. These points, L1 and L2, are located on either side of Earth along a line between the Earth and the Sun. The third Lagrange point, L3, is located along that same line, but on the opposite side of the Sun. […] When do two-body systems have stable Lagrange points? first appeared on John D. Cook.
  • Artificial Intelligence

    AI shopping App
    (0 min) Is their any AI apps that matches your body type (scans your body, etc..) to which clothing stores would have the best sizes to fit to your body? I’m tired of ordering clothes online that don’t fit well when I get them Thx submitted by /u/GroundbreakingRain78 [link] [comments]
    Watch this model describe code
    (2 min) submitted by /u/landongarrison [link] [comments]
    GPT-3, Foundation Models, and AI Nationalism
    (1 min) submitted by /u/regalalgorithm [link] [comments]
    What AIs are there which are able to edit art?
    (0 min) submitted by /u/xXLisa28Xx [link] [comments]
    What if AI ran a city? What if it ran society?
    (1 min) Ever since Grimes posted her Tiktok video last year inviting communists to join up under a supposedly benevolent AI, I've been kind of obsessed with what would really happen if AI ran a city, or even a whole society. Would it really be as awesome as she seems to think or would it be a potentially horrifying nightmare? (My money's on nightmare, btw) Anyway, I tried to answer some of those questions in a work of fiction, available below as a free download. It's a quick read, and I'd love to hear your thoughts on these problems, whether or not you read the book. No doubt someone soon will be trying to implement these kinds of "solutions" in our world... and it's best to be at least marginally prepared! https://lostbooks.gumroad.com/l/conspiratopia/r-artificial Happy New Year! submitted by /u/canadian-weed [link] [comments]
    Mark Twain AI Simulation
    (1 min) submitted by /u/montpelliersudfrance [link] [comments]
    OpenAI Introduces ‘GLIDE’ Model For Photorealistic Image Generation
    (1 min) Images, such as graphics, paintings, and photographs, may frequently be explained in language, but they might also take specific talents and hours of effort to make. As a result, a technology capable of creating realistic graphics from natural language can enable humans to produce rich and diverse visual material with previously unimaginable simplicity. The capacity to modify photos with spoken language enables iterative refinement and fine-grained control, both essential for real-world applications. DALL-E, a 12-billion parameter version of OpenAI’s GPT-3 transformer language model meant to produce photorealistic pictures using text captions as cues, was unveiled in January. DALL-E’s fantastic performance was an instant hit in the AI community, as well as broad mainstream media coverage. Last month, NVIDIA unveiled the GAN-based GauGAN2 – a term inspired by French Post-Impressionist painter Paul Gauguin, much as DALL-E was inspired by Surrealist artist Salvador Dali. Not to be outshined, OpenAI researchers unveiled GLIDE (Guided Language-to-Image Diffusion for Generation and Editing). This diffusion model achieves performance comparable to DALL-E despite utilizing only one-third of the parameters. You can continue reading this short summary here The code and weights for these models may be found on the project’s GitHub page. The research paper for the GLIDE can be found here. https://preview.redd.it/boifxsixem881.png?width=705&format=png&auto=webp&s=14de64b727e31dba7b55ecf76bdfdb52463f2e01 submitted by /u/ai-lover [link] [comments]
  • Reinforcement Learning

    PPO with multiple heads
    (2 min) Hello there, I am working on an implementation for PPO, and I am struggling to compute the loss when two output heads for the actor network are used (one is continuous and one is discrete). The loss in PPO has 3 components: 1) clipped surrogate, 2) squared state-value loss, and 3) Entropy. I thought of treating the two actions separately, and compute two different losses that I add before backpropagating, but the middle term (2) is the same in both losses. How could I do that? Do I add (1) and (3) for both actions and have one loss, or do I compute two different losses then add them? Or is it hidden option 3? :p ​ Any help is appreciated. submitted by /u/AhmedNizam_ [link] [comments]
    Discrete vs Continuous action space
    (1 min) Hello there, I noticed that in many works by OpenAI and Deepmind in RL, they deal with a continuous action space which they discretize. For example rather than having to continuous actions to determine speed and direction of movement, they would discretize speed and torque. Is there any reason why discretized action space is more favorable? submitted by /u/AhmedNizam_ [link] [comments]
    Combined discrete and action space
    (1 min) I need an algorithm that combines both discrete and continuous actions space, like if I wanted to choose action but I needed to also perform that action at an optimized velocity or something. Can I do this with ppo and how. I am still a beginner but must of the algorithms I played with had either separately I thought of actor cri but I don’t think the value network does what I want I can’t use that to predict how much velocity to use submitted by /u/ConsiderationCivil74 [link] [comments]
    Actor net gives me the same reward over time
    (1 min) Hello I´ll be glad if someone could help me with this. I am new in DRL and when I applied it I got some issues. For example, my network freezes in the same reward from the first episode I noticed also that the agent selects always the same index from the action space. Even the same reward and the same trajectory toward this reward, the critic loss is still shrinking over time (this the weirdest thing) I don´t know what is the reson behind that? https://preview.redd.it/02zlhgnqnp881.png?width=477&format=png&auto=webp&s=093c2b22d368e33fa4853ca4f3c5b459c9e2e31d submitted by /u/LeatherCredit7148 [link] [comments]
    Ablation test/study
    (1 min) I'm trying to see the effect of simulated sensor noise in training a successful policy using RL for the cart pole control problem. I have four states: cart position, cart velocity, pole angle, pole angular velocity, and I added Gaussian noise to all of them. However, I want to find the impact of adding noise to a specific variable to find which of the states is the most sensitive to noise. If I were to only test the system with noise in one of the variables at a time (e.g. noise for position, but none for the other variables), would that be considered an ablation test? Just caught up in the terminology, because I know typical ablation tests remove a certain component instead of adding only one (or removing all but one). submitted by /u/Cogitarius [link] [comments]
    Best suite for robotics tasks
    (1 min) Recently, OpenAI removed the robotics tasks from gym, as they are no longer being maintained and Robosuite uses mujoco-py which has been deprecated. Deepmind's control suite includes some robotics tasks but they don't seem to be first-class citizens (they are in a separate module, with less options and some bugs). Are there any other robotics environment suites that are well-maintained and supported? submitted by /u/escapevelocitylabs [link] [comments]
    How to map states into the code?
    (2 min) Hello, I am new in the Reinforcement Learning field. My question is: how can I easily map my states into my code? F.I. when the problem is about how to move an agent in a board-like environment, such as walk from point A to B by moving UP, DOWN, LEFT or RIGHT, it is quite easy to map your states. You can create a matrix like NUMBER_OF_ROWS x NUMBER_OF_COLUMNS and that is pretty much it. However when it comes to other type of scenarios, such as CartPole-V0 from OpenAIGym (https://gym.openai.com/envs/CartPole-v0/), I don't understand how to map the states into my code. This scenario has 4 states and 2 actions: States: Cart Position (range -2.4 --> 2.4) Cart Velocity (range -Inf --> Inf) Pole Angle (range -12deg --> 12deg) Pole Angular Velocity (range -Inf --> Inf) Actions: Push cart to the left Push cart to the right The same is true for every other scenario that is not board-like, f.i. tic-tac-toe. I don't understand how can I make every possible move in the game a state inside the code. ​ Is there any technique to approach each type of problem or do I just need to figure it out by myself? Have a nice day, Gabriel. submitted by /u/gabrieloxi [link] [comments]
    General question in regards to understanding the proofs of Deterministic Policy Gradient
    (1 min) Hi there, I was a graduate student previously working on a few RL-related problems a year ago. And recently I've got much interest in Policy Optimization problems, and started with reading some foundation papers like DQN, DPG. While reading the paper of DPG, it was very straightforward to understand the main idea of the paper, however, it was very difficult to understand the proofs of the theorems shown in the appendix(http://proceedings.mlr.press/v32/silver14-supp.pdf). I barely understood the proof of Theorem1, but proof of Theorem2 is almost impossible to understand due to the lack of my mathematics background. The question is that, could you tell me what I should study or learn before understanding this kind of proof? What kind of subject or material should I look for in order to understand the flow of the proofs and related definition/notation used in the proof of 'DPG'? I personally think that it is crucial to understand the proofs of the main Theorem because it will help me to understand the authors' intuition and motivation which can be the foundation of the paper. submitted by /u/g6ssgs [link] [comments]

2021-12-29

  • Featured Blog Posts - Data Science Central

    AWS Cloud Security: Best Practices
    (4 min) Companies today need to be nimble and ready in the face of constantly evolving technology and changing consumer preferences. To achieve this, organizations are switching to Amazon Web Services (AWS). It enables companies to rapidly deploy and scale technology that meets the growing (or shrinking) demand without having to invest in expensive IT…
  • The Official NVIDIA Blog

    AI Podcast Wrapped: Top Five Episodes of 2021
    (2 min) Recognized as one of tech’s top podcasts, the NVIDIA AI Podcast is approaching 3 million listens in five years, as it sweeps across topics like robots, data science, computer graphics and renewable energy. Its 150+ episodes reinforce the extraordinary capabilities of AI — from diagnosing disease to boosting creativity to helping save the Earth — Read article > The post AI Podcast Wrapped: Top Five Episodes of 2021 appeared first on The Official NVIDIA Blog.
  • Blog – Machine Learning Mastery

    One-Dimensional Tensors in Pytorch
    (9 min) PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.
    Running and Passing Information to a Python Script
    (10 min) Running your Python scripts is an important step in the development process, because it is in this manner that you’ll […] The post Running and Passing Information to a Python Script appeared first on Machine Learning Mastery.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Use cases of Image Segmentation Using Deep Learning
    (4 min) In the last decade, computer vision technology has advanced substantially, thanks to advances in AI and deep learning methodologies. It is…
    Healthcare Technology Trends and Digital Innovations in 2022
    (11 min) With 2020 well behind us, COVID-19’s presence continues to linger around the world. Of all the industries that have been forever changed by…
  • Neural Networks, Deep Learning and Machine Learning

    Baidu And PCL Team Introduce ERNIE 3.0 Titan: A Pre-Training Language Model With 260 Billion Parameters
    (1 min) With recent breakthroughs in AI, humans have become more reliant on AI to address real-world problems. This makes humans’ ability to learn and act on knowledge just as essential as a computer’s. Humans learn and gather information through learning and experience to understand everything from their immediate surroundings. The ability to comprehend and solve issues, and separate facts from absurdities, increases as the knowledge base grows. However, such knowledge is lacking in AI systems, restricting their ability to adapt to atypical problem data. Previous studies show that pre-trained language models improve performance on various natural language interpretation and generating tasks. A recent work of researchers at Baidu, in collaboration with Peng Cheng Laboratory (PCL), release PCL-BAIDU Wenxin (or “ERNIE 3.0 Titan”), a pre-training language model with 260 billion parameters. It is the world’s first knowledge-enhanced multi-hundred billion parameter model and its largest Chinese singleton model. You can read the short summary here: https://www.marktechpost.com/2021/12/29/baidu-and-pcl-team-introduce-ernie-3-0-titan-a-pre-training-language-model-with-260-billion-parameters/ Paper: https://arxiv.org/pdf/2112.12731.pdf ​ https://preview.redd.it/19urwzn7cj881.png?width=1920&format=png&auto=webp&s=6a97e19e9fc4f5dde161e4a6ee17e3b43d86cc39 submitted by /u/ai-lover [link] [comments]
  • Reinforcement Learning

    Favorite papers from 2021
    (2 min) What have been your favorite reads of 2021 in terms of RL papers? I will start! Reward is enough (Reddit Discussion) - Four great names from RL (silver, Singh, Precup and Sutton) give their reasonings as to why using RL can create super intelligence. You might not agree with it, but it's interesting to see the standpoint of Deepmind and where they want to take RL. Deep Reinforcement Learning at the Edge of the Statistical Precipice (Reddit Discussion) - This is a major step towards a better model comparison in RL. Too many papers in the past have used a selection technique akin to 'average top 30 runs in a total of 100'. I have also never even heard of Munchausen RL before this paper, and was pleasantly surprised by reading it. Mastering Atari with Discrete World Models - Very good read and a nice path from Ha's World Models to Dream to Control to DreamerV2. This is one of the methods this year that actually seems to improve performance quite a bit without needing a large scale distributed approach. On the Expressivity of Markov Reward (Reddit Discussion) - The last sentence in the blog post captures it for me: "We hope this work provides new conceptual perspectives on reward and its place in reinforcement learning", it did. Open-Ended Learning Leads to Generally Capable Agents (Reddit Discussion) - Great to see the environment integrated into the learning process, seems like something we will see much more of in the future. Unfortunately, as DeepMind does, the environment is not released nor is the code. I remember positions at OpenAI for open-ended learning, perhaps we might see something next year to compete with this. ​ Most of my picks are not practical algorithms. For me, it seems like PPO is still king when looking at performance and simplicity, kind of a disappointment. I probably missed some papers too. What was your favorite paper in RL 2021? Was it Player of Games (why?), something with Offline RL or perhaps Multi Agents? submitted by /u/YouAgainShmidhoobuh [link] [comments]
    What Happened to OpenAI + RL?
    (2 min) OpenAI used to do a lot of RL research, but it seems like last year and this year the only real RL related work was on benchmark competitions. They even gave away the control of OpenAI Gym. They still have great RL researchers working there, but nothing major has come out. Is it all due to a pivot towards large scale language models that are at least profitable? Is Sam Altman just not interested in RL? submitted by /u/YouAgainShmidhoobuh [link] [comments]

2021-12-28

  • The Official NVIDIA Blog

    It Was a Really Virtual Year: Top Five NVIDIA Videos of 2021
    (3 min) What better way to look back at NVIDIA’s top five videos of 2021 than to hop into the cockpit of a virtual plane flying over Taipei. That was how NVIDIA’s Jeff Fisher and Manuvir Das invited viewers into their COMPUTEX keynote on May 31. Their aircraft sailed over the city’s green hills and banked around Read article > The post It Was a Really Virtual Year: Top Five NVIDIA Videos of 2021 appeared first on The Official NVIDIA Blog.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Keys of Deep Learning : Activation Functions
    (13 min) Biological inspiration of Neural Networks
  • John D. Cook

    Finding Lagrange points L1 and L2
    (2 min) The James Webb Space Telescope (JWST) is on its way to the Lagrange point L2 of the Sun-Earth system. Objects in this location will orbit the Sun at a fixed distance from Earth. There are five Lagrange points, L1 through L5. The points L1, L2, and L3 are unstable, and points L4 and L5 are […] Finding Lagrange points L1 and L2 first appeared on John D. Cook.
    Most popular posts of 2021
    (1 min) These were the most popular posts from 2021 on this site, listed in chronological order. Simple gamma approximation Floor, ceiling, bracket Evolution of random number generators Format text in twitter Where has all the productivity gone? Computing zeta(3) Morse code palindromes Much less than, much greater than Monads and macros Partial functions Most popular posts of 2021 first appeared on John D. Cook.
  • Neural Networks, Deep Learning and Machine Learning

    NeurIPS 2021 - Curated papers - Part 2
    (0 min) In part-2 , I have discussed following papers : Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training Attention Bottlenecks for Multimodal Fusion AugMax: Adversarial Composition of Random Augmentations for Robust Training Revisiting Model Stitching to Compare Neural Representations https://rakshithv-deeplearning.blogspot.com/2021/12/neurips-2021-curated-papers-part2.html submitted by /u/rakshith291 [link] [comments]
    Autoencoders for CIFAR-10
    (0 min) Most of the Autoencoder examples/blogs use MNIST dataset as the implementation. I have trained an autoencoder on CIFAR-10 which you can refer here. There is a trade-off between making the CNN architecture deeper and the improvement of reconstruction loss/error. Or, using a VAE. Thoughts? submitted by /u/grid_world [link] [comments]
    Researchers Compare Deep Learning (DL) Algorithms For Diagnosing Bacterial Keratitis via External Eye Photographs
    (1 min) Viral keratitis, bacterial keratitis, fungal keratitis, and parasitic keratitis are all types of infectious keratitis. Bacterial Keratitis (BK) is a kind of Infectious Keratitis that is one of the most frequent and vision-threatening. Contact lens wear is the most prevalent risk factor for Bacterial Keratitis (BK), and it is becoming increasingly popular around the world for a variety of reasons, including exercise, cosmesis, and myopia management. BK is substantially more fulminant and painful in the clinical course than other Infectious Keratitis(s). A delayed diagnosis of Bacterial Keratitis (BK) can result in large-area corneal ulcerations, melting, and even perforation if not treated. In the case of Infectious Keratitis, timely detection and treatment of BK are vital goals. However,…

2021-12-27

  • Featured Blog Posts - Data Science Central

    Are Data Meshes Really Data Marts with Conformed Dimensions?
    (6 min) Okay, so I am still confused by the concept of a Data Mesh (I’m a slow learner). Recently, I wrote a blog that posed the question: Are Data Meshes the Enabler of the Marginal Propensity to Reuse? The ability to…
    Optimal Scraping Technique: CSS Selector, XPath, & RegEx
    (6 min) Web scraping deals with HTML almost exclusively. In nearly all cases, what is required is a small sample from a very large file (e.g. pricing information from an ecommerce page). Therefore, an essential part of scraping is searching through an HTML document and finding the correct…
    What is a Data Management Platform?
    (4 min) Introduction With the explosion of digital media and the resultant avalanche of constantly increasing user data being created and collected daily, businesses, institutions, and organizations need better ways to manage that data beyond the standard suite of tools. The description of a Data Management Platform or Augmented Data…
  • Becoming Human: Artificial Intelligence Magazine - Medium

    How Artificial Intelligence is Changing the Payment Gateway Industry
    (5 min) The world is in an exciting phase of artificial intelligence that is slowly taking over our daily lives. Alexa, Siri is replacing person
    6 Best Online Courses to learn Computer Vision and OpenCV for Beginners
    (7 min) My favorite online courses, projects, and Computer Vision certification for beginners to learn Computer vision and OpenCV in 2022
  • John D. Cook

    The exception case is normal
    (1 min) Sine and cosine have power series with simple coefficients, but tangent does not. Students are shocked when they see the power series for tangent because there is no simple expression for the power series coefficients unless you introduce Bernoulli numbers, and there’s no simple expression for Bernoulli numbers. The perception is that sine and cosine […] The exception case is normal first appeared on John D. Cook.
    Binomial sums and powers of 2
    (2 min) Marcel Golay noticed that and realized that this suggested it might be possible to create a perfect code of length 23. I wrote a Twitter thread on this that explains how this relates to coding theory (and to sporadic simple groups). This made me wonder how common it is for cumulative sums of binomial coefficients […] Binomial sums and powers of 2 first appeared on John D. Cook.
  • Neural Networks, Deep Learning and Machine Learning

    Creating an optimization algorithm for cost function for NN
    (0 min) Is possible to find an article or an example of a new optimization algorithm for cost function for NN? submitted by /u/adilkolakovic [link] [comments]
    HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields
    (0 min) submitted by /u/nickb [link] [comments]

2021-12-26

  • Featured Blog Posts - Data Science Central

    The web 3 meme
    (4 min) Late in the year, we are suddenly hearing a new term ‘web 3’ For the reasons I describe below, this trend will be significant In this article, I take the perspective of a neutral…

2021-12-25

  • Neural Networks, Deep Learning and Machine Learning

    Getting familiar with neural nrtworks
    (1 min) Any good beginner books on the theory of machine learning systems / DNN submitted by /u/Abeokuta_ [link] [comments]

2021-12-24

  • Featured Blog Posts - Data Science Central

    Machine Learning can give a 10 second Turbulence Warning
    (5 min) Thousands of people are injured by turbulence every year. New machine learning model gives high-accuracy, 10 second warning for turbulence. The model may lessen in-flight injuries and save lives. Turbulence is one of the leading cause of injuries on passenger planes and—if you don’t have your seat belt on—those…
  • Damian Bogunowicz - dtransposed

    Learning by Doing - the DeFi Quest (Part 2 out of 2)
    (10 min) Christmas break is a great time to catch up with the backlog of interesting things to learn and read about. I used some of this time to finish the amazing DeFi quest by Cristian Strat. This is a continuation to the first part of this series, where I document my first steps in the world of Decentralized Finance. If you have not read the first write-up, please do. Otherwise, this text will be very confusing and not too useful for you. But once you go through both blog posts, I guarantee that you will have a DeFi knowledge superior to the majority of the folks out there. Adventure IV This adventure will be more like a short pause from more in-depth topics. I will quickly present some of the best DeFi dashboards out there. A DeFi dashboard is a central place, where you can view your assets, tr…
  • John D. Cook

    O Come, O Come, Emmanuel: condensing seven hymns into one
    (1 min) The Christmas hymn “O Come, O Come, Emmanuel” is a summary of the seven “O Antiphons,” sung on December 17 though 23, dating back to the 8th century [1]. The seven antiphons are O Sapientia O Adonai O Radix Jesse O Clavis David O Oriens O Rex Gentium O Emmanuel The corresponding verses of “O […] O Come, O Come, Emmanuel: condensing seven hymns into one first appeared on John D. Cook.
  • Neural Networks, Deep Learning and Machine Learning

    Artificial Vision with Neocognitron by Kunihiko Fukushima (the father of convolutional neural networks)
    (1 min) submitted by /u/Wild-Dig-8003 [link] [comments]

2021-12-23

  • Featured Blog Posts - Data Science Central

    How to Find a Mobile App Development Firm: Tips for Businesses
    (4 min) Let’s imagine that you have signed a contract with a mobile app development firm. It seems that you have agreed upon everything: terms, budget, the scope of work, and other things. But six months later, it turns out that the developers are not up to their work and the product is…
    What is a Scrum Board? What is the Difference Between a Scrum & Kanban Board?
    (5 min) What is a Scrum Board? As Scrum is one of the popular frameworks to break down complex problems into smaller tasks, Scrum board is a project management software used to visually represent these tasks and Scrum sprints. The scrum board is the center of every sprint meeting to get regular updates…
    Cloud, Cost and Containers three C's in Cloud-Computing in post-COVID
    (6 min) The COVID-19 pandemic has thrown organizations into disarray—the increased usage of conferencing and collaboration services by employees working from home strains back-end support services.  And growing traffic on networks connecting…
  • The Official NVIDIA Blog

    Have a Holly, Jolly Gaming Season on GeForce NOW
    (3 min) Happy holidays, members. This GFN Thursday is packed with winter sales for several games streaming on GeForce NOW, as well as seasonal in-game events. Plus, for those needing a last minute gift for a gamer in their lives, we’ve got you covered with digital gift cards for Priority memberships. To top it all off, six Read article > The post Have a Holly, Jolly Gaming Season on GeForce NOW appeared first on The Official NVIDIA Blog.
  • Blog – Machine Learning Mastery

    Understanding Traceback in Python
    (13 min) When an exception occurs in a Python program, often a traceback will be printed. Knowing how to read the traceback […] The post Understanding Traceback in Python appeared first on Machine Learning Mastery.
  • AWS Machine Learning Blog

    Introducing hybrid machine learning
    (8 min) Gartner predicts that by the end of 2024, 75% of enterprises will shift from piloting to operationalizing artificial intelligence (AI), and the vast majority of workloads will end up in the cloud in the long run. For some enterprises that plan to migrate to the cloud, the complexity, magnitude, and length of migrations may be […]
    Use deep learning frameworks natively in Amazon SageMaker Processing
    (7 min) Until recently, customers who wanted to use a deep learning (DL) framework with Amazon SageMaker Processing faced increased complexity compared to those using scikit-learn or Apache Spark. This post shows you how SageMaker Processing has simplified running machine learning (ML) preprocessing and postprocessing tasks with popular frameworks such as PyTorch, TensorFlow, Hugging Face, MXNet, and […]
  • Neural Networks, Deep Learning and Machine Learning

    Meta AI Announces the Beta Release of ‘Bean Machine’: A PyTorch-Based Probabilistic Programming System Used to Understand the Uncertainty in the Machine Learning Models
    (1 min) Meta AI releases the beta version of Bean Machine, a probabilistic programming framework based on PyTorch that makes it simple to describe and learn about uncertainty in machine learning models used in various applications. Bean Machine makes it possible to create probabilistic models that are domain-specific. It also uses multiple autonomous, uncertainty-aware learning algorithms to learn about the model’s unseen features. Bean Machine gets an early beta version from Meta. Quick Read: https://www.marktechpost.com/2021/12/23/meta-ai-announces-the-beta-release-of-bean-machine-a-pytorch-based-probabilistic-programming-system-used-to-understand-the-uncertainty-in-the-machine-learning-models/ Documentation: https://beanmachine.org/ Tutorials: https://beanmachine.org/docs/tutorials/ Meta Blog: https://research.facebook.com/blog/2021/12/introducing-bean-machine-a-probabilistic-programming-platform-built-on-pytorch/ ​ https://preview.redd.it/xzovhjrzvb781.png?width=1920&format=png&auto=webp&s=1337f3b0f646fd6d888ce02363eb63a6d5257fb7 submitted by /u/ai-lover [link] [comments]
    If the accuracy of my network is zero on the very first epoch, is there any point in letting the training run for the remaining epochs?
    (1 min) submitted by /u/eva01beast [link] [comments]

2021-12-22

  • Featured Blog Posts - Data Science Central

    Scraping Data from Google Search Using Python and Scrapy
    (10 min) Scraping Google SERPs (search engine result pages) is as straightforward or as complicated as the tools we use. For this tutorial, we’ll be using Scrapy, a web scraping framework designed for Python. Python and Scrapy combine to create a powerful duo that we can use to scrape almost any website. Scrapy has many useful…
    3D Secure Authentication: A New Era For Payment Authentication And Customer Experience
    (4 min) 3D Secure Authentication Industry: 3D secure authentication is a fraud-prevention security system for credit and debit card transactions processed online. During card payments, 3D secure authentication adds an additional layer of security. It provides clients with a safe authentication step before…
    Using Cloud and AI Technologies to Make Data Driven Decisions For Monetization
    (4 min) The exchange of data between corporations is known as data monetization. It is the process of earning income or creating new revenue streams by utilizing data, which is estimated to support expansion of the global data monetization market. Direct data monetization as well as indirect data monetization is the two forms of data monetization. The sale of…
    Outsourcing Data Annotation Work
    (4 min) The discipline of data annotation and labeling is growing in popularity and significance throughout the world. The global market for data annotation tools is expected to reach $2.57 billion by 2027, according to a published report. For robots, drones, and vehicles to gain increasing levels of autonomy,…
    Top 7 Reasons Data Scientists Should Know Java Programming Language
    (5 min) Java is the #1 programming language for Big Data, Analytics, DevOps, and AI. It is consistently the first choice for…
    DSC Weekly Digest 21 December 2021: Winter Solstice Shenanigans
    (8 min) Today, the 21st of December, is the Winter Solstice, unless you happen to be in sub-Saharan Africa, Australia, Argentina, or Antarctica, or other points south of the equator. In that case, today is the first day…
  • AI Weirdness

    Christmas entities
    (2 min) When you think about it, Christmas can get pretty weird. There's the classic Christmas story of the Bible, and then there are all these extra entities that aren't in the book but which become somehow part of Christmas. And some of them are quite unsettling. There&
    Bonus: The awesome lightning power of Blitzen, son of Donder
    (1 min) AI Weirdness: the strange side of machine learning
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Add Spark to Your Customer Notifications with a Bentech’s Modern CCM
    (3 min) Is there any track of the number of notifications and alerts you receive each day? As customers, we receive numerous notifications from our…
    Strategies to Reimagine Consumer’s Digital Trip in a New Age
    (4 min) Normal routines of smartphone shops, automobile dealerships, or dining experiences, many of these travels have been transformed into…
  • Seita's Place

    My Information Diet
    (5 min) On July 03 2021, the subject of media and news sources came up in a conversation I had with someone over brunch when we were talking about media bias. I was asked: “what news do you read?” I regret that I gave a sloppy response that sounded like a worse version of: “uh, I read a variety of news …” and then I tried listing a few from memory. I wish I had given a crisper response, and since that day, I have thought about what that person has asked me every day. In this blog post, I describe my information diet, referring to how I read and consume media to understand current events. Before getting to the actual list of media sources, here are a few comments to clarify my philosophy and which might also preemptively address common objections. There are too many sources and not enough time to r…
  • Neural Networks, Deep Learning and Machine Learning

    Face recognition
    (1 min) I am currently working on a project that involves face verification. I want to use Azure Face API and I gotta say I haven’t understood the pricing. I would love if some of you can clarify. If say I have 200 faces and 2000 pictures these faces are in (1 face might be in N pictures), in order to find all the pictures that each face is in, that means I will have to run 400,000 transactions ? As in 2000 iterations per face? Or is there a smarter way to do it? I know there is an option to index faces but i do not fully understand that yet. Thank you for your help! submitted by /u/xPiexPie [link] [comments]
    Managing Collaborative Machine Learning Experiments - Guide
    (1 min) Sharing machine learning experiments to compare its models is important when you're working with a team of engineers. You might need to get another opinion on an experiments results or to share a modified dataset or even share the exact reproduction of a specific experiment. The following tutorial goes through an example of sharing an experiment with DVC remotes: Running Collaborative Experiments - using DVC remotes to share experiments and their data across machines Setting up DVC remotes in addition to your Git remotes lets you share all of the data, code, and hyperparameters associated with each experiment so anyone can pick up where you left off in the training process. When you use DVC, you can bundle your data and code changes for each experiment and push those to a remote for somebody else to check out. submitted by /u/thumbsdrivesmecrazy [link] [comments]
    Noob needs help
    (1 min) Ok so I’m new to neural networks, so I might not understand a lot of the technical terms. I want to show a NN some gameplay. But I don’t know how to. I can capture the keystrokes for each frame but I don’t know what to do next. Do I take screenshots and save those frames w the keystrokes to my disk? I kinda know how to use Cv2, Pillow, and the keyboard module. I’m really lost, so any help will be highly appreciated Ps- I want the model to learn how to play the game submitted by /u/findcureforautism [link] [comments]

2021-12-21

  • The Official NVIDIA Blog

    3D Artist Turns Hobby Into Career, Using Omniverse to Turn Sketches Into Masterpieces
    (3 min) It was memories of playing Pac-Man and Super Mario Bros while growing up in Colombia’s sprawling capital of Bogotá that inspired Yenifer Macias’s award-winning submission for the #CreateYourRetroverse contest, featured above. The contest asked NVIDIA Omniverse users to share scenes that visualize where their love for graphics began. For Macias, that passion goes back to Read article > The post 3D Artist Turns Hobby Into Career, Using Omniverse to Turn Sketches Into Masterpieces appeared first on The Official NVIDIA Blog.
    NVIDIA BlueField Sets New World Record for DPU Performance
    (4 min) Data centers need extremely fast storage access, and no DPU is faster than NVIDIA’s BlueField-2. Recent testing by NVIDIA shows that two BlueField-2 data processing units reached 41.5 million input/output operations per second (IOPS) — more than 4x more IOPS than any other DPU. The BlueField-2 DPU delivered record-breaking performance using standard networking protocols and Read article > The post NVIDIA BlueField Sets New World Record for DPU Performance appeared first on The Official NVIDIA Blog.
    How Omniverse Wove a Real CEO — and His Toy Counterpart — Together With Stunning Demos at GTC
    (5 min) It could only happen in NVIDIA Omniverse — the company’s virtual world simulation and collaboration platform for 3D workflows. And it happened during an interview with a virtual toy model of NVIDIA’s CEO, Jensen Huang. “What are the greatest …” one of Toy Jensen’s creators asked, stumbling, then stopping before completing his scripted question. Unfazed, Read article > The post How Omniverse Wove a Real CEO — and His Toy Counterpart — Together With Stunning Demos at GTC  appeared first on The Official NVIDIA Blog.
  • Becoming Human: Artificial Intelligence Magazine - Medium

    Transfer Learning — Part — 5.1!! Implementing ResNet in Keras
    (113 min) In Part 5.0 of the Transfer Learning series we have discussed about ResNet pre-trained model in depth so in this series we will implement…
    5 Ways To Use AI For Supply Chain Management
    (7 min) Optimizing Our Supply Chain Using Artificial Intelligence
2022-01-20T00:41:55.723Z osmosfeed 1.12.0